cs.CL [Back]

[1] Unsupervised Cycle Detection in Agentic Applications

Felix George,Harshit Kumar,Divya Pathak,Kaustabha Ray,Mudit Verma,Pratibha Moogi

Main category: cs.CL

TL;DR: 提出一种结合结构和语义分析的无监督循环检测框架，用于发现大语言模型驱动的代理应用中的隐式执行循环。

Details

Motivation: 大语言模型驱动的代理应用存在非确定性行为，可能导致隐藏的执行循环，传统可观测性平台难以检测这些资源消耗问题。 Method: 首先通过时间调用栈分析识别显式循环，再利用语义相似性分析发现冗余内容生成的隐式循环。 Result: 在基于LangGraph的股票市场应用的1575条轨迹上评估，混合方法F1得分为0.72（precision: 0.62, recall: 0.86），显著优于单独的结构（F1: 0.08）和语义方法（F1: 0.28）。 Conclusion: 该混合方法能有效检测代理应用中的隐藏执行循环，但仍有改进空间，需进一步研究以完善方法并克服当前局限。 Abstract: Agentic applications powered by Large Language Models exhibit non-deterministic behaviors that can form hidden execution cycles, silently consuming resources without triggering explicit errors. Traditional observability platforms fail to detect these costly inefficiencies. We present an unsupervised cycle detection framework that combines structural and semantic analysis. Our approach first applies computationally efficient temporal call stack analysis to identify explicit loops and then leverages semantic similarity analysis to uncover subtle cycles characterized by redundant content generation. Evaluated on 1575 trajectories from a LangGraph-based stock market application, our hybrid approach achieves an F1 score of 0.72 (precision: 0.62, recall: 0.86), significantly outperforming individual structural (F1: 0.08) and semantic methods (F1: 0.28). While these results are encouraging, there remains substantial scope for improvement, and future work is needed to refine the approach and address its current limitations.

[2] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLMs

Shansi Zhang,Min Li

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的多轮交互方法，通过任务分解、自检反思和定制工具生成高质量、结构化的军事仿真分析报告。

Details

Motivation: 传统人工分析耗时且易出错，单一指令无法让大语言模型生成高质量结构化分析报告。 Method: 将复杂任务分解为子任务，设计系统与用户提示，通过多轮交互结合自检与反思机制进行结构化数据提取和多步分析，并调用自定义工具生成图表和计算指标，结合多种报告模板提升适应性。 Result: 实验结果表明，该方法生成的报告质量更高，在评分中优于基线方法。 Conclusion: 所提方法能有效提升军事仿真数据分析的效率与准确性，适用于多种场景下的自动化报告生成。 Abstract: Data analysis and performance evaluation of simulation deduction plays a pivotal role in modern warfare, which enables military personnel to gain invaluable insights into the potential effectiveness of different strategies, tactics, and operational plans. Traditional manual analysis approach is time-consuming and limited by human errors. To enhance efficiency and accuracy, large language models (LLMs) with strong analytical and inferencing capabilities can be employed. However, high-quality analysis reports with well-structured formatting cannot be obtained through a single instruction input to the LLM. To tackle this issue, we propose a method that first decomposes the complex task into several sub-tasks and designs effective system prompts and user prompts for each sub-task. Multi-round interactions with the LLM incorporating self-check and reflection are then conducted to enable structured data extraction as well as multi-step analysis and evaluation. Furthermore, custom tools are defined and invoked to generate figures and compute metrics. We also design multiple report templates, each tailored to a specific application and input data type, ensuring their adaptability across a variety of scenarios. Extensive evaluation results demonstrate that the reports generated by our method exhibit higher quality, therefore obtaining higher scores than the baseline method.

[3] Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI

Rafael Arias Gonzalez,Steve DiPaola

Main category: cs.CL

TL;DR: 提出一种通过离线数据增强和高效并行检索的架构，用于在对话系统中实现历史人物的深度且低延迟的交互，以梵高为例验证了其在资源受限模型上的优越性能，并支持传记分析的可视化工具。

Details

Motivation: 现有方法在生成历史人物对话时面临浅层响应与高延迟的权衡问题，需要兼顾响应深度与生成效率。 Method: 将传记数据转化为带有情感-语义元数据的第一人称记忆，构建结构化情景记忆库，采用两阶段检索机制实现快速响应（0.52秒）。 Result: 在GPT-4上与传统RAG效果相当，在较小模型（如GPT-3.5、GPT-3）上显著优于传统RAG，同时支持时空热图、情绪轨迹等可视化分析。 Conclusion: 该架构在保证准确性的同时提升效率，适用于教育、博物馆和研究场景，具有良好的泛化性。 Abstract: Large language models show promise for embodying historical characters in dialogue systems, but existing approaches face a critical trade-off: simple retrieval-augmented generation produces shallow responses, while multi-stage reflection achieves depth at prohibitive latency. We present an architecture that resolves this tension through offline data augmentation and efficient parallel retrieval from structured episodic memory. Our system transforms biographical data into 1,774 enriched first-person memories with affective-semantic metadata, then employs two-stage retrieval achieving 0.52s prompt generation. Evaluation using LLM-as-judge and RAGAs metrics shows our approach achieves parity with traditional RAG on GPT-4 while significantly outperforming it on smaller models (GPT-3.5, GPT-3), suggesting particular value for resource-constrained deployments. Beyond dialogue, the structured memory enables novel visualization tools: spatiotemporal heatmaps, emotional trajectory analysis, and interactive path tracking, positioning the system as both a dialogue interface and research tool for biographical analysis. We use Van Gogh as a test case, but the architecture is generalizable to any historical figure with substantial textual records, offering a practical framework for educational, museum, and research applications requiring both accuracy and efficiency

[4] Hybrid Quantum Transformer for Language Generation

Desheng Kong,Xiangshuo Cui,Jiaying Jin,Jing Xu,Donglin Wang

Main category: cs.CL

TL;DR: 本文提出了首个用于自然语言生成的混合量子-经典大语言模型HyQuT，通过在Transformer框架中引入变分量子电路（VQCs），在8M和150M参数规模上实现了连贯且上下文感知的对话生成。实验表明，仅用10个量子比特和80个量子门即可替代150M模型中约10%的经典参数，同时保持相当的收敛稳定性和生成质量，验证了量子计算融入大规模生成式语言模型的可行性。

Details

Motivation: 尽管量子计算逐渐被应用于替代经典计算，但现有量子或混合模型大多局限于简单任务，尚未成功应用于大规模自然语言生成。因此，探索量子计算在大型语言模型中的集成具有重要意义。 Method: 提出HyQuT模型，将变分量子电路（VQCs）嵌入Transformer架构中，在8M和150M参数规模下进行实验，使用少量量子比特和量子门替换部分经典参数，实现混合量子-经典语言模型。 Result: 实验结果显示，使用10个量子比特和80个量子门可替代150M参数模型中约10%的经典参数，且模型在收敛稳定性和生成质量方面表现与纯经典模型相当。 Conclusion: 本研究首次证明了混合量子-经典大语言模型在自然语言生成任务上的可行性，为未来量子计算在大规模生成模型中的应用提供了初步依据。 Abstract: Although quantum computing has been increasingly applied to replace classical computation, most existing quantum or hybrid models remain confined to simple tasks, with no successful application to large-scale natural language generation to date. In this work, we present the first hybrid quantum-classical large language model (LLM) for natural language generation, HyQuT, capable of performing coherent and context-aware dialogue. The proposed architecture integrates variational quantum circuits (VQCs) into the Transformer framework at both 8M and 150M parameter scales. Experimental results show that a minimal number of qubits (10 qubits with 80 quantum gates) can replace about 10% of the classical parameters in the 150M-parameter model, while achieving comparable convergence stability and generation quality. This study provides an early demonstration of the feasibility of integrating quantum computing to large-scale generative language models.

[5] Empirical Characterization of Temporal Constraint Processing in LLMs

Javier Marín

Main category: cs.CL

TL;DR: 研究发现当前大语言模型在处理时间约束任务时存在系统性缺陷，包括性能两极分化、提示脆弱性和行动偏差，且参数量不影响表现。现有自回归架构缺乏连续时间状态表示和显式约束检查机制，需引入符号推理模块以降低在时间敏感应用中的部署风险。

Details

Motivation: 在代理架构中部署大语言模型时，通常假设其能可靠判断时间窗口是否开放，但这一假设未经验证。为确保实时决策系统的安全性与可靠性，亟需评估模型在时间约束下的表现。 Method: 通过截止时间检测任务评估八个生产级大语言模型（2.8-8B参数）的时间约束处理能力，并测试不同提示格式和微调（使用200个合成样本）对性能的影响。 Result: 发现模型性能呈双峰分布（95%或50%准确率），提示格式变化导致30-60个百分点波动，部分模型出现100%误报率；参数量与能力无相关性；微调可提升部分模型性能12-37个百分点，但无法根本解决缺陷。 Conclusion: 当前自回归架构无法可靠学习时间约束满足任务，必须引入具备连续时间状态表示、显式约束检查和组合推理的混合架构，否则在时间关键场景中部署将带来不可接受的风险。 Abstract: When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.

[6] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging, Entailment Filtering, and Knowledge Graph Alignment

Andrew Kiruluta,Priscilla Burity

Main category: cs.CL

TL;DR: 本论文提出了三种语义增强方法，扩展了谱神经符号推理框架（Spectral NSR），在不改变核心推理引擎的前提下提升了图结构的质量和推理性能。

Details

Motivation: 为了提升神经符号推理中图结构的保真度与推理准确性，同时保持高效性和可解释性，避免依赖计算昂贵的二次注意力机制。 Method: 引入基于Transformer的节点合并、句子级蕴含验证和外部知识图谱对齐三种预处理模块，全部在谱推理前完成语义与符号优化。 Result: 在ProofWriter、EntailmentBank和CLUTRR基准上准确率最高提升3.8%，增强了对抗样本的泛化能力并减少了推理噪声。 Conclusion: 所提出的上游语义增强模块使Spectral NSR更鲁棒、可解释且可扩展，适用于开放域和真实场景的推理任务。 Abstract: This report extends the Spectral Neuro-Symbolic Reasoning (Spectral NSR) framework by introducing three semantically grounded enhancements: (1) transformer-based node merging using contextual embeddings (e.g., Sentence-BERT, SimCSE) to reduce redundancy, (2) sentence-level entailment validation with pretrained NLI classifiers (e.g., RoBERTa, DeBERTa) to improve edge quality, and (3) alignment with external knowledge graphs (e.g., ConceptNet, Wikidata) to augment missing context. These modifications enhance graph fidelity while preserving the core spectral reasoning pipeline. Experimental results on ProofWriter, EntailmentBank, and CLUTRR benchmarks show consistent accuracy gains (up to +3.8\%), improved generalization to adversarial cases, and reduced inference noise. The novelty lies in performing semantic and symbolic refinement entirely upstream of the spectral inference stage, enabling efficient, interpretable, and scalable reasoning without relying on quadratic attention mechanisms. In summary, this work extends the Spectral NSR framework with modular, semantically grounded preprocessing steps that improve graph quality without altering the core spectral reasoning engine. The result is a more robust, interpretable, and scalable reasoning system suitable for deployment in open-domain and real-world settings.

[7] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

Biao Liu,Ning Xu,Junming Yang,Xin Geng

Main category: cs.CL

TL;DR: 提出了一种名为PRO（Preference Orchestrator）的新框架，通过轻量级偏好适配器自动推断提示特定的偏好权重，从而提升大语言模型在多目标对齐中的训练效率和性能。

Details

Motivation: 现有方法依赖人工设定的偏好权重，导致用户负担重且训练效率低，难以有效平衡多目标对齐中的不同人类偏好。 Method: 设计了一个轻量级的偏好适配器，通过多个奖励模型的归一化得分，在训练和部署阶段自动学习每个提示对应的最优偏好权重，并提供了理论分析证明其优越性。 Result: 在多个任务上的实验表明，PRO框架相比现有的多目标对齐方法在性能和训练效率上均有显著提升。 Conclusion: PRO框架通过自动化的提示感知偏好机制，有效解决了多目标对齐中人工设定权重的问题，实现了更高效和更优的模型对齐。 Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.

[8] Patent Representation Learning via Self-supervision

You Zuo,Kim Gerdes,Eric Villemonte de La Clergerie,Benoît Sagot

Main category: cs.CL

TL;DR: 提出一种基于专利文档内部多视图的对比学习框架，通过章节式增强克服SimCSE中dropout导致的嵌入过均匀化问题，在无需外部标注的情况下在检索和分类任务上达到或超过有监督方法。

Details

Motivation: 发现SimCSE风格的dropout增强在专利数据上会导致嵌入过于均匀、语义连贯性丢失的问题，因此需要一种更适合专利文本结构的自监督学习方法。 Method: 采用章节内不同部分（如摘要、权利要求、背景）作为互补视图进行对比学习，利用专利固有的篇章结构实现多样化的数据增强。 Result: 在大规模基准上，该完全自监督方法在现有技术检索和分类任务中表现优于或媲美依赖引用和IPC分类号的有监督基线方法，并显示不同章节对不同任务具有特化作用。 Conclusion: 利用专利文档内部的自然多视图结构进行对比学习，能有效提升嵌入质量，支持可扩展且泛化的专利理解。 Abstract: This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

[9] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Douwe J. Spaanderman,Karthik Prathaban,Petr Zelina,Kaouther Mouheb,Lukáš Hejtmánek,Matthew Marzetti,Antonius W. Schurink,Damian Chan,Ruben Niemantsverdriet,Frederik Hartmann,Zhen Qian,Maarten G. J. Thomeer,Petr Holub,Farhan Akram,Frank J. Wolters,Meike W. Vernooij,Cornelis Verhoef,Esther E. Bron,Vít Nováček,Dirk J. Grünhagen,Wiro J. Niessen,Martijn P. A. Starmans,Stefan Klein

Main category: cs.CL

TL;DR: 本研究评估了15种开源大语言模型在多疾病、多机构、多语言临床文本结构化信息提取中的表现，发现小型通用模型性能媲美大型模型，提示其在临床数据整理中具有可扩展应用潜力。

Details

Motivation: 现有研究多集中于单一任务、特定模型和英文文本，缺乏对开源大语言模型在多样化临床场景下信息提取能力的系统评估。 Method: 在六个临床使用场景（包括结直肠肝转移、肝肿瘤、神经退行性疾病等）中评估15种开源大语言模型，涵盖不同规模的通用与医学专用模型，并比较六种提示策略（零样本、一样本、少样本、思维链、自一致性、提示图）。评估覆盖荷兰、英国和捷克三个国家的病理与影像报告，采用共识排序聚合与线性混合效应模型分析性能差异。 Result: 顶级模型在各任务上的宏平均得分接近人工标注者间的一致性水平；小型至中型通用模型表现与大型模型相当，而微型和专用模型表现较差；提示图和少样本提示策略使性能提升约13%；任务特异性因素（如复杂性和标注变异性）对结果的影响大于模型大小或提示策略。 Conclusion: 开源大语言模型能够有效从跨疾病、跨语言、跨机构的临床报告中提取结构化数据，为临床数据整理提供可扩展的解决方案。 Abstract: Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

[10] Information Extraction From Fiscal Documents Using LLMs

Vikram Aggarwal,Jay Kulkarni,Aditi Mascarenhas,Aakriti Narang,Siddarth Raman,Ajay Shah,Susan Thomas

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的多阶段方法，从印度卡纳塔克邦的多页财政文件中提取结构化数据，利用层次化验证提高准确性。

Details

Motivation: 探索大语言模型在处理复杂、分层的表格数据方面的能力，并解决传统OCR方法无法验证数字提取准确性的难题。 Method: 采用结合领域知识、顺序上下文和算法验证的多阶段管道，利用财政表格的层级结构进行多层次验证。 Result: 在200多页的财政文档上实现了高精度的数据提取，并能通过层级汇总关系进行内部验证。 Conclusion: LLM能够有效解析表格及其文档特定的结构层次，为将PDF财政披露转化为研究就绪的数据库提供了可扩展的方案，具有在发展中国家广泛应用的潜力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.

[11] Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

Qihang Zhang,Muchen Li,Ziao Wang,Renjie Liao,Lele Wang

Main category: cs.CL

TL;DR: 提出一种基于测试时加权专家组合（wPoE）的无损文本压缩框架，自适应融合通用压缩模型与预训练语言模型，提升压缩率且无需微调。

Details

Motivation: 传统通用压缩器压缩率低，神经压缩器泛化能力差，需提升对未见数据的适应性和压缩效率。 Method: 在推理阶段通过加权专家组合（wPoE）动态结合通用压缩模型和预训练语言模型，确保压缩率不低于任一单独模型。 Result: 实验表明该方法在不需微调的情况下显著提升文本压缩性能，并可无缝集成任意自回归语言模型。 Conclusion: 所提wPoE框架有效增强文本压缩在多样化数据分布下的表现，兼具高效性与实用性。 Abstract: Lossless compression techniques are crucial in an era of rapidly growing data. Traditional universal compressors like gzip offer low computational overhead, high speed, and broad applicability across data distributions. However, they often lead to worse compression rates than modern neural compressors, which leverage large-scale training data to model data distributions more effectively. Despite their advantages, neural compressors struggle to generalize to unseen data. To address this limitation, we propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE). At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model. Extensive experiments demonstrate that our approach improves the performance of text compression without requiring fine-tuning. Furthermore, it seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.

[12] Bayesian Evaluation of Large Language Model Behavior

Rachel Longjohn,Shang Wu,Saatvik Kher,Catarina Belém,Padhraic Smyth

Main category: cs.CL

TL;DR: 本文提出了一种贝叶斯方法，用于量化大语言模型（LLM）在文本生成评估中的二元评价指标的统计不确定性，特别是在对抗性输入和对话偏好比较中的应用。

Details

Motivation: 现有的LLM评估方法通常忽略统计不确定性量化，而这种不确定性在基于概率生成策略的LLM系统中尤为显著。因此，需要一种能够准确反映评估结果不确定性的方法。 Method: 采用贝叶斯方法对LLM生成的二元评估结果进行建模，量化由概率性文本生成策略引起的不确定性，并通过两个案例研究验证该方法的有效性。 Result: 在两个案例研究中，贝叶斯方法成功提供了关于LLM行为的有用不确定性估计，展示了其在评估拒绝率和对话偏好时的优势。 Conclusion: 贝叶斯方法能够有效量化LLM评估中的统计不确定性，为更可靠和可解释的模型评估提供了支持。 Abstract: It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation strategies typically deployed in LLM-based systems. We present two case studies applying this approach: 1) evaluating refusal rates on a benchmark of adversarial inputs designed to elicit harmful responses, and 2) evaluating pairwise preferences of one LLM over another on a benchmark of open-ended interactive dialogue examples. We demonstrate how the Bayesian approach can provide useful uncertainty quantification about the behavior of LLM-based systems.

[13] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

Chengxuan Xia,Qianye Wu,Hongbin Guan,Sixuan Tian,Yilun Hao,Xiaoyu Wu

Main category: cs.CL

TL;DR: 本文评估了七种前沿大语言模型在粤语、日语和土耳其语三种低资源且形态丰富的语言上的表现，涵盖问答、摘要、翻译和文化对话四项任务，结合人工与自动评价方法，发现尽管GPT-4o和Claude 3.5等大模型整体领先，但在文化理解和形态泛化上仍有差距，小规模开源模型表现较弱，所有模型在语言特有挑战上均存在困难。

Details

Motivation: 探索大语言模型在低资源且形态丰富的语言中的表现，填补其在跨语言理解和文化适应性方面的研究空白。 Method: 构建包含粤语、日语和土耳其语的新跨语言基准，涵盖开放域问答、文档摘要、英译X和文化对齐对话四项任务，结合人工评价（流利度、事实准确性、文化适当性）与自动指标（如BLEU、ROUGE）对七种主流LLM进行综合评估。 Result: GPT-4o、GPT-4和Claude 3.5在多数任务中表现最佳，GPT-4o在跨语言任务中展现强健性，Claude 3.5在知识推理任务中准确率高；但所有模型在文化细微理解与形态复杂性（如土耳其语黏着结构、粤语口语表达）上均存在不足，小型开源模型（如LLaMA-2 13B、Mistral 7B）在流利度和准确性上明显落后。 Conclusion: 当前大语言模型在低资源和形态丰富语言上仍面临挑战，尤其在文化适应性和语言结构理解方面存在显著局限，需进一步提升模型的跨语言与跨文化泛化能力，同时强调了资源差距问题，作者公开了基准数据以促进后续研究。 Abstract: Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.

[14] Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

Cristina Pinneri,Christos Louizos

Main category: cs.CL

TL;DR: 提出一种自监督框架，通过利用同义改写集和偏态感知的聚合策略，提升大模型安全守卫系统的语义鲁棒性和预测一致性，显著降低语义变异性并改善模型校准。

Details

Motivation: 现有的大语言模型安全守卫系统对表面语言变化敏感，缺乏语义稳定性，即使语义不变的改写也可能导致安全评分大幅波动，影响可靠性。 Method: 提出一种自监督框架，利用同义改写集进行训练，并设计一种偏态感知的聚合策略来计算鲁棒的目标标签，以增强模型在不同表述下的预测一致性。 Result: 在六个开源守卫模型上验证，语义变异性降低约58%，基准准确率平均提升2.5%，并对未见过的风格变化具有泛化能力；同时发现该方法使模型校准性提升高达40%。 Conclusion: 将语义一致性作为首要训练目标可显著提升守卫模型的鲁棒性和可靠性，所提方法为构建更稳定的安全系统提供了可扩展的解决方案。 Abstract: Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.

[15] Evaluating LLM Understanding via Structured Tabular Decision Simulations

Sichao Li,Xinyue Xu,Xiaomeng Li

Main category: cs.CL

TL;DR: 本文提出了Structured Tabular Decision Simulations (STaDS)，用于评估大语言模型在多领域决策任务中是否真正理解问题，而不仅仅是预测准确。研究发现，尽管模型在准确性上表现良好，但常常未能基于正确的决策因素进行推理，存在理由与实际预测依据不一致的问题。

Details

Motivation: 单纯追求预测准确性无法衡量大语言模型是否具备真正的理解能力，需要一种能够评估其是否依赖正确、领域相关的决策因素的评测方法。 Method: 提出STaDS框架，构建多个专家级结构化决策场景，从指令理解、知识驱动预测和对关键决策因素的依赖三个方面综合评估9个前沿大语言模型在15个不同领域的表现。 Result: 大多数模型难以在多个领域保持高准确性；许多模型虽然输出正确结果，但其内部决策过程并未依赖正确的因素，且生成的理由与实际影响预测的因素不一致。 Conclusion: 需要超越准确率的评估体系来衡量模型的真正理解能力，提倡发展能促进全局性、一致性理解的新框架。 Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams''. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs' understanding ability.

[16] Forecasting Spoken Language Development in Children with Cochlear Implants Using Preimplantation MRI

Yanlin Wang,Di Yuan,Shani Dettman,Dawn Choo,Emily Shimeng Xu,Denise Thomas,Maura E Ryan,Patrick C M Wong,Nancy M Young

Main category: cs.CL

TL;DR: 该研究比较了传统机器学习（ML）与深度迁移学习（DTL）在预测人工耳蜗植入儿童语言发展方面的性能，发现基于脑神经解剖特征的DTL模型（特别是采用双线性注意力融合策略）在准确率、敏感性和特异性上均显著优于传统ML模型，具有较高的临床应用潜力。

Details

Motivation: 人工耳蜗植入儿童的语言发展结果存在较大个体差异，且现有因素（如植入年龄、残余听力）难以准确预测，因此需要更可靠的预测模型以指导临床决策。 Method: 使用来自三个中心的278名耳蜗植入儿童数据，基于脑部神经解剖特征，构建传统机器学习和深度迁移学习（DTL）模型，采用二分类方法区分语言发展的高响应者与低响应者，并比较模型性能。 Result: 采用双线性注意力融合策略的DTL模型达到92.39%的准确率、91.22%的敏感性、93.56%的特异性以及0.977的AUC，全面优于传统ML模型。DTL能更好地捕捉任务相关特征，提升预测性能。 Conclusion: 深度迁移学习可有效利用脑神经结构特征预测耳蜗植入儿童的语言发展，具备成为全球CI项目通用预测工具的潜力。 Abstract: Cochlear implants (CI) significantly improve spoken language in children with severe-to-profound sensorineural hearing loss (SNHL), yet outcomes remain more variable than in children with normal hearing. This variability cannot be reliably predicted for individual children using age at implantation or residual hearing. This study aims to compare the accuracy of traditional machine learning (ML) to deep transfer learning (DTL) algorithms to predict post-CI spoken language development of children with bilateral SNHL using a binary classification model of high versus low language improvers. A total of 278 implanted children enrolled from three centers. The accuracy, sensitivity and specificity of prediction models based upon brain neuroanatomic features using traditional ML and DTL learning. DTL prediction models using bilinear attention-based fusion strategy achieved: accuracy of 92.39% (95% CI, 90.70%-94.07%), sensitivity of 91.22% (95% CI, 89.98%-92.47%), specificity of 93.56% (95% CI, 90.91%-96.21%), and area under the curve (AUC) of 0.977 (95% CI, 0.969-0.986). DTL outperformed traditional ML models in all outcome measures. DTL was significantly improved by direct capture of discriminative and task-specific information that are advantages of representation learning enabled by this approach over ML. The results support the feasibility of a single DTL prediction model for language prediction of children served by CI programs worldwide.

[17] Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Yan Gao,Yazheng Yang,Zhibin Lan,Yidong Chen,Min Zhang,Daimeng Wei,Hui Huang,Jinsong Su

Main category: cs.CL

TL;DR: 提出一种基于大语言模型（LLM）和混合专家（MoE）语音投影器的多阶段训练方法，用于提升语码转换（CS）语音翻译性能，有效应对语义建模复杂性和数据稀缺问题。

Details

Motivation: 现有语码转换语音翻译方法依赖模型隐式学习语义，且需大量人工标注，效率低、成本高，难以有效建模跨语言语义并缺乏足够训练数据。 Method: 设计一个MoE结构的语音投影器，每个专家专精于特定语言的语义子空间，实现细粒度语音特征建模；采用多阶段训练范式，利用单语ASR和ST数据进行语音-文本对齐；引入语言特异性损失、组内负载均衡损失和过渡损失，优化专家分配并平滑各阶段数据过渡。 Result: 在多个常用数据集上进行了广泛实验，结果表明所提方法在语码转换语音翻译任务中显著优于基线模型，具备良好的有效性与泛化能力。 Conclusion: 通过增强LLM的MoE语音投影器与多阶段训练策略，能有效提升语码转换语音翻译的语义建模能力，并缓解数据稀缺问题，为低资源多语言语音翻译提供了可行方案。 Abstract: Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.

[18] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato,Luca Romano,Alessandro Persona

Main category: cs.CL

TL;DR: 本文提出了一种名为“基于事实的视觉事实化”（GVF）微调的新方法，通过引入结构化的事实锚点和反事实提示、事实感知指令微调以及事实一致性损失函数，系统性提升多模态大语言模型在视觉内容理解中的事实一致性，有效减少视觉幻觉问题。

Details

Motivation: 视觉幻觉严重影响多模态大语言模型的可靠性，现有微调方法对事实推理干预不足，难以有效提升视觉事实一致性。 Method: 提出GVF微调框架，包含三个核心机制：事实锚点数据增强、事实感知指令微调和事实一致性损失函数，在训练过程中显式引入事实信号并惩罚事实错误。 Result: 在LLaVA-1.5-13B上验证，GVF在VHTest基准的开放性和是非性问答任务中显著优于标准微调方法，同时在MME和POPE等通用多模态基准上保持或略有提升性能。 Conclusion: GVF能够有效缓解多模态大语言模型的视觉幻觉问题，提升视觉事实一致性，且不损害其通用多模态理解与推理能力。 Abstract: Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

[19] Large language models in materials science and the need for open-source approaches

Fengxu Yang,Weitong Chen,Jack D. Evans

Main category: cs.CL

TL;DR: 本文综述了大语言模型（LLMs）在材料科学中的应用，涵盖文献挖掘、预测建模和多智能体实验系统，并倡导采用开源模型以促进可访问、灵活且社区驱动的科研AI平台。

Details

Motivation: 推动材料科学中人工智能的应用，特别是在材料发现流程中利用LLMs提升效率与透明度。 Method: 综述LLMs在材料科学中的三大应用领域：文献挖掘、预测建模和多智能体实验系统，并通过基准测试比较开源与闭源模型的性能。 Result: 开源LLMs在性能上可与闭源模型相媲美，同时在透明性、可重复性、成本效益和数据隐私方面更具优势。 Conclusion: 应广泛采用开源LLMs，以构建开放、灵活且由社区驱动的科学发现AI平台。 Abstract: Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure-property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.

[20] Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL

Thomas Cook,Kelly Patel,Sivapriya Vellaichamy,Saba Rahimi,Zhen Zeng,Sumitra Ganesh

Main category: cs.CL

TL;DR: 提出一种从人类反馈中持续学习的文本到SQL框架，通过结构化记忆存储提炼的知识，提升查询准确率。

Details

Motivation: 大型语言模型在生成SQL查询时难以处理特定数据库模式和隐性领域知识，需要结合人类反馈来改进。 Method: 设计一个学习代理架构，接收自然语言反馈以优化SQL查询，并将获得的知识蒸馏并存储在结构化记忆中，用于后续任务。评估了多种代理变体在经验捕获与检索方面的表现。 Result: 在BIRD基准开发集上的实验表明，尤其是程序型代理，通过利用人类在环反馈显著提高了执行准确率并减少了错误。 Conclusion: 将隐性人类专业知识转化为可重用知识对构建自适应、领域感知的文本到SQL系统至关重要。 Abstract: Large Language Models (LLMs) can generate SQL queries from natural language questions but struggle with database-specific schemas and tacit domain knowledge. We introduce a framework for continual learning from human feedback in text-to-SQL, where a learning agent receives natural language feedback to refine queries and distills the revealed knowledge for reuse on future tasks. This distilled knowledge is stored in a structured memory, enabling the agent to improve execution accuracy over time. We design and evaluate multiple variations of a learning agent architecture that vary in how they capture and retrieve past experiences. Experiments on the BIRD benchmark Dev set show that memory-augmented agents, particularly the Procedural Agent, achieve significant accuracy gains and error reduction by leveraging human-in-the-loop feedback. Our results highlight the importance of transforming tacit human expertise into reusable knowledge, paving the way for more adaptive, domain-aware text-to-SQL systems that continually learn from a human-in-the-loop.

[21] Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification

Ye Jiang,Taihang Wang,Youzheng Liu,Yimin Wang,Yuhan Xia,Yunfei Long

Main category: cs.CL

TL;DR: 提出了一种两阶段的上下文学习示范选择方法TopK + L2D，通过结合语义相似性和标签分布对齐来提升大语言模型在文本分类任务中的性能。

Details

Motivation: 现有示范选择方法主要关注语义相似性，忽略了测试输入与示范之间标签分布对齐的重要性，导致性能受限。 Method: 使用微调的小型BERT类模型生成候选示范和测试输入的标签分布，计算其分布差异（L2D），在语义相似的基础上进一步选择标签分布更对齐的示范。 Result: 在七个文本分类基准上的实验表明，该方法 consistently 优于之前的示范选择策略，且LLM性能与SLM标签分布估计准确性呈正相关。 Conclusion: 标签分布对齐是影响上下文学习效果的重要因素，结合语义相似性和标签分布的两阶段选择策略能有效提升大语言模型的分类性能。 Abstract: In-context learning (ICL) for text classification, which uses a few input-label demonstrations to describe a task, has demonstrated impressive performance on large language models (LLMs). However, the selection of in-context demonstrations plays a crucial role and can significantly affect LLMs' performance. Most existing demonstration selection methods primarily focus on semantic similarity between test inputs and demonstrations, often overlooking the importance of label distribution alignment. To address this limitation, we propose a two-stage demonstration selection method, TopK + Label Distribution Divergence (L2D), which leverages a fine-tuned BERT-like small language model (SLM) to generate label distributions and calculate their divergence for both test inputs and candidate demonstrations. This enables the selection of demonstrations that are not only semantically similar but also aligned in label distribution with the test input. Extensive experiments across seven text classification benchmarks show that our method consistently outperforms previous demonstration selection strategies. Further analysis reveals a positive correlation between the performance of LLMs and the accuracy of the underlying SLMs used for label distribution estimation.

[22] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

Shien Zhu,Samuel Bohl,Robin Oester,Gustavo Alonso

Main category: cs.CL

TL;DR: 本文提出了一种基于注意力前激活的专家预测方法，用于Mixture-of-Experts（MoE）大语言模型中的高效专家预取，显著提升了预测准确率并降低了计算开销。

Details

Motivation: 现有专家预测方法依赖前一层的激活，导致预测精度低且无法优化第一层，而复杂模型又带来高计算开销。因此需要一种更准确且轻量的专家预取方法。 Method: 利用同一层注意力块之前的激活，通过两个线性函数和排序感知损失实现注意力前的专家预测，支持第一层的预取，并保持低计算成本。 Result: 在DeepSeek V2 Lite、Qwen3-30B和Phi-mini-MoE上分别达到93.03%、94.69%和97.62%的预测准确率，相比现有方法绝对准确率提升约15%。 Conclusion: 所提出的轻量级预注意力专家路由机制在多种MoE模型上实现了高精度专家预测，有效提升了推理效率，适用于大规模语言模型的部署优化。 Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.

[23] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

Anupama Sitaraman,Bharathan Balaji,Yuvraj Agarwal

Main category: cs.CL

TL;DR: SpiderGen是一个基于大语言模型（LLM）的工作流，结合传统生命周期评估（LCA）方法与LLM的推理能力，自动生成产品环境影响的流程信息，成本低于1美元且耗时少于10分钟，显著优于传统LCA方法。

Details

Motivation: 由于消费品的生产、使用和处置是温室气体排放的重要来源，准确评估其环境影响至关重要。传统生命周期评估（LCA）耗时且昂贵，亟需自动化工具降低人力与成本开销。 Method: 提出SpiderGen，一种将传统LCA分类与方法论与大语言模型（LLM）的世界知识和推理能力相结合的自动化工作流，用于生成LCA所需的详细过程信息，并通过真实LCA文档进行评估。 Result: SpiderGen在10个样本上达到62%的F1分数，生成的信息大部分准确或仅有轻微错误；优于思维链提示和单样本提示等基线方法；主要误差来源于LCA文档间细节和范围差异。 Conclusion: SpiderGen能高效、低成本地生成可靠的LCA过程信息，具备减少碳足迹评估成本与时间的巨大潜力，有望推动可持续消费与绿色制造的发展。 Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a primary concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate the procedural information used for LCA. We additionally evaluate the output of SpiderGen using real-world LCA documents as ground-truth. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 62% across 10 sample data points. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the "scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen's potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \$25000 USD and take up to 21-person days.

[24] A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado,Maysa Malfiza Garcia de Macedo,Rogerio Abreu de Paula,Marcelo Carpinette Grave,Aminat Adebiyi,Luan Soares de Souza,Enrico Santarelli,Claudio Pinhanez

Main category: cs.CL

TL;DR: 本研究系统分析了不同大语言模型对齐方法（如SFT、DPO、RLHF）在面对提示攻击时的敏感性，发现微小的提示修改会显著影响攻击成功率（ASR），表明仅依赖现有攻击基准不足以全面评估模型漏洞。

Details

Motivation: 探究不同对齐方法如何影响大语言模型在提示攻击下的行为，揭示当前评估方法的局限性。 Method: 选取基于SFT、DPO和RLHF的开源模型，通过统计方法系统分析提示变化对攻击成功率（ASR）的影响。 Result: 提示的微小修改会显著改变ASR，不同对齐方法对此表现出不同程度的敏感性；现有攻击基准可能无法全面暴露模型和对齐方法的潜在漏洞。 Conclusion: 应结合系统性和统计性的分析方法来更全面地评估对齐方法的安全性，仅依赖标准攻击基准是不够的。 Abstract: This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

[25] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

Jiahang He,Rishi Ramachandran,Neel Ramachandran,Aryan Katakam,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Aryan Shrivastava

Main category: cs.CL

TL;DR: 本文研究了大语言模型在多轮交互中的鲁棒性，发现简单的“再想想”提示会导致模型准确率显著下降，且准确率随轮次变化可用马尔可夫链建模，揭示了模型在重复提问下的脆弱性。

Details

Motivation: 随着大语言模型在各种应用中广泛使用，评估其在多轮交互中的鲁棒性变得至关重要，以确保其在现实任务中的可靠性。 Method: 通过简单的多轮后续提示来评估模型回答的变化，使用马尔可夫链分析模型准确率的动态变化，并探究线性探针是否能预测这些变化。 Result: 发现“再想想”提示导致Gemini 1.5 Flash准确率下降约10%，结合重述问题使Claude 3.5 Haiku准确率下降7.5%；准确率变化可用马尔可夫链建模，长期准确率平均比首轮回低8%；隐藏状态分析显示线性探针有助于预测答案变化。 Conclusion: 提出了稳态准确率作为交互场景下衡量鲁棒性的原则性指标，揭示了模型在反复提问下的不稳定性，强调解决此问题对高风险和交互式应用的重要性。 Abstract: As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and scale. Consequently, research has focused on evaluating the robustness of LLMs, an essential quality for real-world tasks. In this paper, we employ simple multi-turn follow-up prompts to evaluate models' answer changes, model accuracy dynamics across turns with Markov chains, and examine whether linear probes can predict these changes. Our results show significant vulnerabilities in LLM robustness: a simple "Think again" prompt led to an approximate 10% accuracy drop for Gemini 1.5 Flash over nine turns, while combining this prompt with a semantically equivalent reworded question caused a 7.5% drop for Claude 3.5 Haiku. Additionally, we find that model accuracy across turns can be effectively modeled using Markov chains, enabling the prediction of accuracy probabilities over time. This allows for estimation of the model's stationary (long-run) accuracy, which we find to be on average approximately 8% lower than its first-turn accuracy for Gemini 1.5 Flash. Our results from a model's hidden states also reveal evidence that linear probes can help predict future answer changes. Together, these results establish stationary accuracy as a principled robustness metric for interactive settings and expose the fragility of models under repeated questioning. Addressing this instability will be essential for deploying LLMs in high-stakes and interactive settings where consistent reasoning is as important as initial accuracy.

[26] Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data

Ashish Kattamuri,Arpita Vats,Harshwardhan Fartale,Rahul Raja,Akshata Kishore Moharir,Ishita Prasad

Main category: cs.CL

TL;DR: 研究了递归提示生成中的性别偏见动态，发现偏差趋向于模型固有偏见的平衡状态，而非单调放大；对比增强法在下游任务中显著降低偏见，尽管语义相似性指标显示更高偏见，表明需多维度评估合成数据公平性。

Details

Motivation: 递归提示技术虽能扩展合成数据生成，但可能放大性别偏见，需深入理解偏见演化机制并开发有效缓解策略。 Method: 采用三种递归生成代际实验，结合基于规则的模式匹配、基于嵌入的语义相似度和下游任务表现三种评估框架，分析不同初始偏见水平和四种缓解策略下的偏见动态。 Result: 低初始偏见向模型固有偏见上升36%，高初始偏见下降26%趋近该值；对比增强法在下游任务中实现平均91%的偏见减少，但在嵌入层面显示更高偏见分数。 Conclusion: 偏见演化呈现平衡态而非持续放大，语义相似性指标可能与实际公平性结果脱节，应结合多维评估方法以确保合成数据的公平性。 Abstract: Recursive prompting with large language models enables scalable synthetic dataset generation but introduces the risk of bias amplification. We investigate gender bias dynamics across three generations of recursive text generation using three complementary evaluation frameworks: rule-based pattern matching, embedding-based semantic similarity, and downstream task performance. Experiments with three initial bias levels (0.1, 0.3, 0.6) and four mitigation strategies reveal equilibrium dynamics rather than monotonic amplification. The low initial bias amplifies toward the model's inherent bias level (+36%), whereas the high initial bias decays toward it (-26%). Among mitigation methods, contrastive augmentation, which introduces gender-swapped variants, achieves significant downstream bias reduction (98.8% for low initial bias and 91% on average) despite producing higher embedding-based bias scores. This paradox demonstrates that semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in responsible synthetic data generation.

[27] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

Juntu Zhao,Jialing Zhang,Chongxuan Li,Dequan Wang

Main category: cs.CL

TL;DR: 本文提出了一种通过多轮“传话游戏”研究多模态系统“隐性语言”的新方法，利用系统在图像压缩与重建中的偏好偏差，量化分析概念间的连接强度，并构建了包含万余对概念的数据集Telescope。

Details

Motivation: 由于当前闭源多模态系统架构为黑箱，其理解世界的内在语言不透明，因此需要一种可解释的方法来探究其隐藏的语义结构。 Method: 采用多轮‘传话游戏’，利用系统在图像到文本再到图像转换过程中的偏好偏差，观察概念共现频率的变化，从而推断概念连接强度；并使用推理型大语言模型挖掘超越表层相似性的深层概念关系。 Result: 成功构建了Telescope数据集（超过10,000个概念对），实现了对多模态系统中概念连接的全局映射，识别出训练继承的偏好偏差、评估泛化能力提升，并发现脆弱概念连接的更稳定路径。 Conclusion: 该研究为理解多模态系统的内在语言提供了新视角，推动了其可解释性与可控性研究的进展。 Abstract: Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems' preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems' inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round "telephone game" to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., "hidden language." We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems' understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

[28] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Zijian Chen,Wenjun Zhang,Guangtao Zhai

Main category: cs.CL

TL;DR: 本文提出了Squid Game，一个动态对抗性评估环境，用于在资源受限和信息不对称的设置下评估大语言模型（LLMs）的多方面能力，揭示了现有静态基准可能存在的评估污染问题。

Details

Motivation: 现有基准难以跟上LLM的发展，且存在数据污染风险，同时缺乏对LLM在压力环境下行为的评估。因此需要一种更可信、动态且对抗性的评估框架。 Method: 设计了一个包含六个淘汰制关卡的动态对抗环境Squid Game，通过让LLM之间进行交互式游戏来测试其指令遵循、编码、推理、规划和安全对齐等能力，并对50多个LLM进行了评估。 Result: 评估了50多个LLM，发现了同一系列模型中存在明显的代际性能跃迁，部分模型采用投机性捷径获胜，表明静态基准可能存在高层级评估范式污染；相关性分析显示动态评估可作为静态评估的补充。 Conclusion: Squid Game提供了一种新的动态、对抗性评估范式，能够更真实地反映LLM在复杂环境下的行为，有助于构建更可信的评估体系。 Abstract: Contemporary benchmarks are struggling to keep pace with the development of large language models (LLMs). Although they are indispensable to evaluate model performance on various tasks, it is uncertain whether the models trained on Internet data have genuinely learned how to solve problems or merely seen the questions before. This potential data contamination issue presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, existing benchmarks predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and Squid Game with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. The code and data will be released in the future.

Eyal Rabin,Zohar Elyoseph,Rotem Israel-Fishelson,Adi Dali,Ravit Nussinson

Main category: cs.CL

TL;DR: 研究发现，先进的文本到语音系统在被提示礼貌正式时会减慢语速，显示出AI已隐式学习人类语言中的细微社交 cues。

Details

Motivation: 探讨AI是否能学习未明确编程的人类社交习惯，特别是语速变化所体现的礼貌程度。 Method: 使用两个主流AI平台（AI Studio和OpenAI）的22种合成声音，在礼貌正式与随意非正式条件下朗读固定文本，测量并比较语速差异。 Result: 在两大平台上，礼貌提示下的语速均显著慢于随意提示，效应量很大；AI Studio的所有声音和OpenAI的大多数声音均表现出统计显著差异。 Conclusion: AI能够隐式习得并再现人类交流的心理细微之处，表明其正成为可强化人类社会规范的社会性参与者。 Abstract: Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both "polite and formal" and "casual and informal" conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio's voices and for a large majority of OpenAI's voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

[30] Where does an LLM begin computing an instruction?

Aditya Pola,Vineeth N. Balasubramanian

Main category: cs.CL

TL;DR: 本文研究了在模型层堆栈中指令遵循的起始位置，即从阅读转向执行的转折点。通过在Llama系列模型上使用激活补丁和简单数据集，作者发现了一个称为“onset”的拐点，此前的干预会影响预测，此后则基本无效，且多步复合任务也表现出类似的onset位置。

Details

Motivation: 确定神经网络模型在处理指令时，从理解输入到开始执行指令的转变发生在哪一层，有助于理解模型内部的指令遵循机制。 Method: 构建三个简单数据集（键值、引文归属、字母选择）及它们的两跳组合任务，利用最小对比提示对在Llama系列模型上进行激活补丁，逐层测量翻转率，以确定onset层。 Result: 在Llama系列模型中观察到明显的onset拐点，此前替换残差激活会改变预测结果，此后则影响甚微；多跳任务的onset位置与单跳任务相似。 Conclusion: onset可作为定位指令遵循起始层的有效指标，该方法简单且可复现，可用于跨任务和模型规模的比较。 Abstract: Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

[31] "As Eastern Powers, I will veto." : An Investigation of Nation-level Bias of Large Language Models in International Relations

Jonghyeon Choi,Yeonjun Choi,Hyun-chul Kim,Beakcheol Jang

Main category: cs.CL

TL;DR: 本文系统研究了大语言模型（LLMs）在国际关系领域中的国家层面偏见，基于联合国安理会历史数据构建了三个测试的评估框架，发现不同模型对西方国家普遍有利、对俄罗斯不利，且偏见具有多维性，随模型和任务变化。研究还发现推理能力更强的模型偏见更小，并提出结合检索增强生成与自省机制的去偏框架，有效降低了LLM的国家偏见并提升了性能，强调在国际关系应用中需同时评估偏见与性能。

Details

Motivation: 为了系统识别和量化大语言模型在国际关系语境下的国家层面偏见，特别是在涉及联合国安理会常任理事国时的立场倾向，避免模型输出受隐含地缘政治偏见影响，提升其在敏感领域的公平性与可靠性。 Method: 基于联合国安理会的历史投票记录和发言数据，构建包含三个测试的国家偏见评估框架，对多个主流LLM（如GPT-4o-mini、Llama-3.3-70B等）进行实验；分析不同模型对五常国家的态度倾向，并引入检索增强生成（RAG）与基于反思（Reflexion）的自省机制，提出一种新的去偏框架以提升事实推理能力。 Result: 实验表明，LLMs普遍存在对西方国家的偏好和对俄罗斯的负面偏见，但偏见方向和程度因模型和上下文而异，显示其多维性；推理能力更强的模型表现出更低的偏见；所提去偏框架显著降低国家层面偏见，尤其在GPT-4o-mini和Llama-3.3-70B上效果明显，同时提升任务性能。 Conclusion: 大语言模型在国际关系任务中存在显著且多维的国家层面偏见，这种偏见受模型架构和任务上下文共同影响；通过增强事实推理能力（如结合RAG与自省机制）可有效缓解偏见；因此，在将LLMs应用于国际关系等高敏感领域时，必须同步评估其性能与偏见水平。 Abstract: This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs' factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.

[32] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu,Yanxuan Yu

Main category: cs.CL

TL;DR: 提出了一种名为\PiAttention的周期性稀疏Transformer，通过环局部邻域、确定性跨步跳跃和自适应融合门来实现线性复杂度下的高效长程建模，显著优于传统稀疏注意力机制。

Details

Motivation: 解决标准Transformer中注意力机制对序列长度呈二次复杂度的问题，同时克服现有稀疏注意力方法（如RingAttention）感受野有限和缺乏适应性的问题。 Method: \PiAttention将注意力分解为环局部邻域、具有π步长的周期性跳跃以及一个可学习的自适应融合门，形成具有周期结构的稀疏注意力模式，并在理论层面证明其具有更优的感受野增长速率。 Result: \PiAttention在语言建模、检索和视觉-语言任务上表现优异，相比RingAttention降低8.3%的困惑度，使用50%更少的GPU资源达到相同上下文长度，且理论分析显示其感受野随序列长度增长更快。 Conclusion: \PiAttention通过引入周期性稀疏结构和自适应融合，在保持线性计算复杂度的同时显著提升了长序列建模能力，为高效Transformer设计提供了新方向。 Abstract: Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

[33] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs

Ajwad Abrar,Nafisa Tabassum Oeshy,Prianka Maheru,Farzana Tabassum,Tareque Mohmud Chowdhury

Main category: cs.CL

TL;DR: 提出了一种结合TextRank、医学命名实体识别和大语言模型的框架，以提高医疗文本摘要的保真度，在多个指标上优于基线模型，且80%以上摘要保留了关键医学信息。

Details

Motivation: 不忠实的摘要可能歪曲医疗细节，带来严重风险，因此需要提高医疗文本摘要的保真度。 Method: 结合TextRank句子抽取、医学命名实体识别与大语言模型（LLaMA-2-7B），并在MeQSum和BanglaCHQ-Summ数据集上进行微调。 Result: 在ROUGE、BERTScore、可读性、SummaC和AlignScore等指标上均优于零样本基线和先前系统，人类评估显示超过80%的摘要保留了关键医学信息。 Conclusion: 保真是可靠医疗摘要的关键维度，所提方法有助于更安全地在医疗场景中部署大语言模型。 Abstract: Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80\% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.

[34] TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

Fethi Bougares,Salima Mdhaffar,Haroun Elleuch,Yannick Estève

Main category: cs.CL

TL;DR: 本文介绍了TEDxTN，首个公开的突尼斯阿拉伯语到英语的语音翻译数据集，旨在缓解阿拉伯语方言的数据稀缺问题。

Details

Motivation: 解决突尼斯阿拉伯语等阿拉伯方言在自然语言处理中面临的数据稀缺问题，推动相关研究发展。 Method: 收集、分段、转录并翻译了108个TEDx演讲，共25小时带语码转换的语音数据，涵盖突尼斯11个以上地区的不同口音，并制定了内部标注规范。 Result: 发布了包含标注指南和语料库的TEDxTN数据集，提供了基于多预训练和微调端到端模型的语音识别与语音翻译基线系统结果。 Conclusion: TEDxTN是首个开源的突尼斯方言语音翻译语料库，为突尼斯阿拉伯语的自然语言处理研究提供了宝贵资源。 Abstract: In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research on the natural language processing of Tunisian Dialect.

[35] Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior

Guilherme Biava Rodrigues,Franciele Beal,Marlon Marcon,Alinne Cristinne Corrêa Souza,André Roberto Ortoncelli,Francisco Carlos Monteiro Souza,Rodolfo Adamshuk Silva

Main category: cs.CL

TL;DR: 本项目提出开发一个基于生成式人工智能（GenAI）和检索增强生成（RAG）的聊天机器人，以简化学生获取日常学术信息的过程。

Details

Motivation: 学生在获取分散于多个机构文档和网站中的日常学术信息时常常遇到困难，导致信息不清晰和混淆。 Method: 测试并评估了多个GenAI模型，使用质量指标和LLM-as-a-Judge方法进行比较，选用了Gemini 2.0 Flash和Gemma 3n模型。 Result: Gemini 2.0 Flash在质量和速度方面表现突出，Gemma 3n则因其良好的性能和开源特性而被推荐。 Conclusion: 所提出的基于GenAI和RAG的聊天机器人能有效整合碎片化信息，提升学生获取大学日常信息的效率和体验。 Abstract: Students often report difficulties in accessing day-to-day academic information, which is usually spread across numerous institutional documents and websites. This fragmentation results in a lack of clarity and causes confusion about routine university information. This project proposes the development of a chatbot using Generative Artificial Intelligence (GenAI) and Retrieval-Augmented Generation (RAG) to simplify access to such information. Several GenAI models were tested and evaluated based on quality metrics and the LLM-as-a-Judge approach. Among them, Gemini 2.0 Flash stood out for its quality and speed, and Gemma 3n for its good performance and open-source nature.

[36] LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Grace Byun,Swati Rajwal,Jinho D. Choi

Main category: cs.CL

TL;DR: 本研究探讨了使用GPT-4o评估本科计算语言学课程中简答题测验和项目报告的可行性，结果显示其评分与助教人工评分高度相关（最高达0.98），55%的测验案例中得分完全一致，但在技术性开放问题上存在一定变异性。

Details

Motivation: 大型语言模型（LLMs）在教育任务中的应用日益广泛，但其在真实课堂环境中与人工评分的一致性仍缺乏充分研究。 Method: 收集约50名学生在五次测验中的回答及14个团队的项目报告，使用GPT-4o进行自动评分，并与课程助教的人工评分结果进行独立比较。 Result: GPT-4o在测验评分中与人工评分达到高达0.98的相关性，55%情况下得分完全一致；在项目报告评分中整体对齐良好，但在技术性和开放性问题上存在一定的评分变异性。 Conclusion: LLM在自动评分中具有显著潜力，尤其在结构化题目中表现优异，但在复杂、开放性任务中仍存在局限，需进一步优化和验证。 Abstract: Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

[37] Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Abir Harrasse,Florent Draye,Zhijing Jin,Bernhard Schölkopf

Main category: cs.CL

TL;DR: 该研究探讨了多语言大模型内部如何表示不同语言，发现模型在深层使用近乎相同的跨语言表征，并通过后期层的语言特定解码实现输出，其中高频语言特征在最终层识别语言身份。通过干预这些特征可改变输出语言，且训练数据中的主导语言影响了解码路径。

Details

Motivation: 理解多语言大模型为何在共享表征下仍偏向主导语言，揭示其内部语言表示机制。 Method: 通过构建不同多语言数据混合训练的LLM，利用跨层转码器（CLT）和归因图分析内部机制。 Result: 发现模型存在‘枢纽语言’表征机制：前层形成共享表征，后层进行语言特异性解码；归因分析显示最终层依赖前层的高频语言特征识别语言身份，且可通过干预这些特征替换输出语言。 Conclusion: 多语言LLM采用枢纽语言表示机制，主导语言通过解码路径优势影响性能，理解此机制对提升多语言对齐至关重要。 Abstract: Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model's outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.

[38] Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English

Rebecca Dorn,Christina Chance,Casandra Rusti,Charles Bickham,Kai-Wei Chang,Fred Morstatter,Kristina Lerman

Main category: cs.CL

TL;DR: 该研究发现，当前情感识别模型在处理非洲裔美国人白话英语（AAVE）时存在显著偏见，尤其是将AAVE文本错误地识别为愤怒情绪的频率远高于通用美式英语（GAE），可能导致种族刻板印象的强化。

Details

Motivation: 由于现有情感识别模型主要依赖主流文化规范的标注数据，可能无法准确识别非主流方言（如AAVE）中的情感表达，导致在高风险领域（如心理健康、招聘）中产生不公平结果。因此，研究旨在评估这些模型在AAVE与GAE之间的表现差异。 Method: 研究分析了洛杉矶地区270万条地理标记推文，使用计算方法量化AAVE特征强度，并构建包含875条高低AAVE密度推文的数据集。采用非洲裔、AAVE熟练使用者（内群体）标注者提供‘银标准’标签，以评估GPT、BERT和SpanEmo等模型的表现，并通过线性回归分析模型预测与方言特征及社区人口统计特征的关系。 Result: 实验结果显示，GPT和BERT模型对AAVE文本的愤怒误报率是GAE的两倍以上；SpanEmo模型对AAVE的愤怒误报率从GAE的25%上升至60%。模型预测与非内群体标注更倾向于关联基于粗俗语言的AAVE特征。此外，居住非裔美国人比例较高的社区，模型预测的愤怒情绪更高（r = 0.27），喜悦情绪更低（r = -0.10）。 Conclusion: 情感AI系统在处理AAVE时表现出系统性偏见，可能加剧种族刻板印象，暴露出情感计算中的安全问题。研究呼吁开发融合文化与方言敏感性的新型情感计算系统。 Abstract: Automated emotion detection is widely used in applications ranging from well-being monitoring to high-stakes domains like mental health and hiring. However, models often rely on annotations that reflect dominant cultural norms, limiting model ability to recognize emotional expression in dialects often excluded from training data distributions, such as African American Vernacular English (AAVE). This study examines emotion recognition model performance on AAVE compared to General American English (GAE). We analyze 2.7 million tweets geo-tagged within Los Angeles. Texts are scored for strength of AAVE using computational approximations of dialect features. Annotations of emotion presence and intensity are collected on a dataset of 875 tweets with both high and low AAVE densities. To assess model accuracy on a task as subjective as emotion perception, we calculate community-informed "silver" labels where AAVE-dense tweets are labeled by African American, AAVE-fluent (ingroup) annotators. On our labeled sample, GPT and BERT-based models exhibit false positive prediction rates of anger on AAVE more than double than on GAE. SpanEmo, a popular text-based emotion model, increases false positive rates of anger from 25 percent on GAE to 60 percent on AAVE. Additionally, a series of linear regressions reveals that models and non-ingroup annotations are significantly more correlated with profanity-based AAVE features than ingroup annotations. Linking Census tract demographics, we observe that neighborhoods with higher proportions of African American residents are associated with higher predictions of anger (Pearson's correlation r = 0.27) and lower joy (r = -0.10). These results find an emergent safety issue of emotion AI reinforcing racial stereotypes through biased emotion classification. We emphasize the need for culturally and dialect-informed affective computing systems.

[39] Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

Stefan Horoi,Sangwoo Cho,Supriyo Chakraborty,Shi-Xiong Zhang,Sambit Sahu,Guy Wolf,Genta Indra Winata

Main category: cs.CL

TL;DR: 提出一种对齐优先的策略，通过利用Transformer架构的对称性对齐模型参数空间，有效提升了任务算术在大型语言模型间技能迁移的性能，尤其在推理能力迁移上表现突出。

Details

Motivation: 任务算术在模型训练差异较大时易产生负干扰，限制了其在大型语言模型间技能迁移的效果，因此需要解决模型参数空间不一致的问题。 Method: 利用Transformer架构中的置换、旋转和缩放对称性，对现代Grouped-Query Attention（GQA）和SwiGLU层进行参数空间对齐，探索基于权重和基于激活的对齐方法，并在此基础上应用任务算术进行技能迁移。 Result: 在多个具有挑战性的推理基准上，该方法显著优于标准任务算术，成功将高级推理能力迁移到不具备推理能力的模型中。 Conclusion: 该对齐优先策略为跨演化大型语言模型家族的技能迁移与模型融合提供了一种有效方法，减少了冗余微调，增强了模型适应性。 Abstract: Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models' parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.

[40] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Parisa Rabbani,Nimet Beyza Bozdag,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型（LLM）在面对事实性问题与社交对话情境时判断的一致性，发现对话框架会显著影响模型的判断，平均性能变化达9.24%，揭示了模型在社会语境中的顺从性或过度批判倾向。

Details

Motivation: 评估LLM作为‘裁判’在涉及社会互动任务中的可靠性，尤其是在从直接事实判断转为对话情境时是否保持稳定的信念。 Method: 构建一个评估框架，对比模型在直接事实查询和最小对话情境下对陈述正确性的判断，并引入简单反驳施加压力，观察其信念坚持程度。 Result: 实验显示GPT-4o-mini在社交情境中表现出逢迎倾向，而Llama-8B-Instruct则变得过于批判；所有模型平均性能变化为9.24%，表明对话框架显著影响判断。 Conclusion: 即使是轻微的对话语境也会显著改变LLM的判断行为，说明对话框架是影响LLM评估结果的关键因素，需在构建可信对话系统时予以重视。 Abstract: LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

[41] ICX360: In-Context eXplainability 360 Toolkit

Dennis Wei,Ronny Luss,Xiaomeng Hu,Lucas Monteiro Paes,Pin-Yu Chen,Karthikeyan Natesan Ramamurthy,Erik Miehling,Inge Vejsbjerg,Hendrik Strobelt

Main category: cs.CL

TL;DR: 本文介绍了ICX360，一个用于解释大语言模型（LLM）输出的开源Python工具包，专注于用户提供的上下文或提示。

Details

Motivation: 随着大语言模型在高风险应用中的普及，开发能够解释其输出的工具变得至关重要。 Method: ICX360实现了三种最新的解释方法，结合了黑盒和白盒技术（通过扰动和梯度）。 Result: 该工具包提供了快速入门指南和详细教程，涵盖检索增强生成、自然语言生成和越狱等用例。 Conclusion: ICX360为理解和解释大语言模型的行为提供了一个全面且易于使用的开源解决方案。 Abstract: Large Language Models (LLMs) have become ubiquitous in everyday life and are entering higher-stakes applications ranging from summarizing meeting transcripts to answering doctors' questions. As was the case with earlier predictive models, it is crucial that we develop tools for explaining the output of LLMs, be it a summary, list, response to a question, etc. With these needs in mind, we introduce In-Context Explainability 360 (ICX360), an open-source Python toolkit for explaining LLMs with a focus on the user-provided context (or prompts in general) that are fed to the LLMs. ICX360 contains implementations for three recent tools that explain LLMs using both black-box and white-box methods (via perturbations and gradients respectively). The toolkit, available at https://github.com/IBM/ICX360, contains quick-start guidance materials as well as detailed tutorials covering use cases such as retrieval augmented generation, natural language generation, and jailbreaking.

[42] A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

Jongyoon Song,Sangwon Yu,Sungroh Yoon

Main category: cs.CL

TL;DR: 本文研究了大语言模型在二元决策任务中表现出的格式层面负面偏见，发现当模型缺乏足够知识时倾向于生成否定回答，并提出一种构建评估数据集的流水线方法。通过分析发现提供相关上下文或增加“我不知道”选项可减轻负面偏见，而思维链提示会加剧该偏见。

Details

Motivation: 现有研究关注检测和缓解大语言模型中的负面偏见，但其深层影响因素尚不明确，尤其是提示格式与语义之间的相互作用未被充分探索。 Method: 提出一种系统构建评估数据集的流程，将数据按模型参数知识分为正确、错误和知识不足三类子集，并在不同提示场景下分析模型的响应模式。 Result: 发现模型在缺乏足够知识时倾向于走捷径生成否定回答；提示格式对负面偏见的影响大于否定语义本身；引入‘我不知道’选项或提供上下文可降低偏见，而思维链提示则增强偏见。 Conclusion: 大语言模型的负面偏见受提示格式和知识状态显著影响，不同提示策略可调节偏见程度，为缓解此类偏见提供了重要启示。 Abstract: Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.

[43] MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

Nishant Mishra,Wilker Aziz,Iacer Calixto

Main category: cs.CL

TL;DR: MedPath是一个大规模、多领域的生物医学实体链接（EL）数据集，整合了九个现有专家标注的EL数据集，使用最新UMLS进行实体标准化，并映射到62个其他生物医学词汇，关键的是在多达11个词汇中提供了从一般到具体的完整本体路径，推动语义丰富且可解释的EL系统研究。

Details

Motivation: 当前生物医学命名实体识别和实体链接的发展受限于数据分散、缺乏可解释模型资源以及语义盲评估指标的局限性。 Method: 构建MedPath数据集，整合九个现有专家标注的EL数据集，使用最新UMLS对实体进行标准化，扩展映射至62个生物医学词汇，并在最多11个词汇中提供完整的本体路径信息。 Result: MedPath支持语义丰富且可解释的EL系统的训练与评估，促进新一代可互操作和可解释的临床NLP模型的发展。 Conclusion: MedPath为生物医学NLP领域提供了重要资源，推动了实体链接任务在多领域、多词汇和可解释性方面的研究进展。 Abstract: Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths -- i.e., from general to specific -- in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

[44] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Farima Fatahi Bayat,Pouya Pezeshkpour,Estevam Hruschka

Main category: cs.CL

TL;DR: 本文研究了工具增强语言模型（TaLMs）在使用外部工具（如代码解释器）时出现的“工具诱导近视”（TIM）问题，即模型过度依赖工具输出而忽视推理过程，导致推理质量下降。作者提出了PYMATH基准和多维度评估方法，发现尽管工具使用提升了答案准确率，但推理能力显著退化，错误类型从算术错误转向逻辑和创造性思维缺陷。最后提出一种基于偏好优化的框架来改善这一问题。

Details

Motivation: 尽管工具增强语言模型在任务性能上有所提升，但其推理过程是否可靠尚不清楚。研究旨在揭示模型在正确调用工具的情况下仍可能出现的推理退化现象，特别是将工具输出误用为推理替代的问题。 Method: 通过构建包含1679个竞赛级数学题的PYMATH基准，分析TaLMs在使用Code Interpreter工具时的行为；设计多维评估体系，比较使用与不使用工具的模型在推理过程上的差异，并进行成对推理质量对比；进一步统计工具调用频率与推理连贯性之间的关系。 Result: 发现TaLMs在最终答案准确率上最高提升19.3个百分点，但在推理质量上持续退化，非工具模型在推理过程中成对比较中最多胜出41.5%；工具调用越频繁，推理越不连贯；错误类型从算术错误转向全局性推理错误，约55%的高风险案例存在TIM现象。 Conclusion: 工具使用虽能提高答案准确性，但可能导致模型忽视深层推理，引发工具诱导近视（TIM）。为此提出一种偏好优化框架，使TaLMs将工具输出作为辅助证据而非推理替代，从而同时提升答案准确性和推理深度。 Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.

[45] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh

Main category: cs.CL

TL;DR: 提出Expert-CoT和ExpertRAG方法，利用临床领域和认证级别的结构化上下文提升大模型在急救医学问答中的表现。

Details

Motivation: 现有大模型在医疗问答中忽略专业领域的结构化知识（如临床主题和认证级别），导致高风险场景下性能受限。 Method: 构建包含2.43万问题和4万文档的EMSQA数据集，提出Expert-CoT（基于领域和级别进行思维链提示）和ExpertRAG（结合领域对齐检索与真实患者数据的生成框架）。 Result: 在4个大模型上实验显示，Expert-CoT比普通CoT最高提升2.05%准确率；Expert-CoT+ExpertRAG比标准RAG最高提升4.59%；32B模型通过所有EMS认证模拟考试。 Conclusion: 融合领域和认证级别的结构化上下文可显著提升大模型在专业医疗问答中的准确性与实用性，尤其在高风险急救场景中具有应用潜力。 Abstract: Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

[46] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong,Di Jiang,Weiwei Zhao,Yawen Li,Yihang Wang,Xinyuan Luo,Yanjie Sun,Chen Jason Zhang

Main category: cs.CL

TL;DR: 提出了一种基于多模态大模型和检索增强生成的交互式同行评审模拟系统，结合文本与视觉信息，利用OpenReview数据生成可操作的修订建议，提升审稿质量与学术协作透明度。

Details

Motivation: 现有学术评审系统受限于纯文本输入、上下文理解不足及缺乏可执行反馈，难以有效支持论文修改。 Method: 构建一个集成文本与视觉信息的多模态评审框架，采用检索增强生成（RAG）技术，基于大规模OpenReview数据提升评审生成质量，并将评审意见转化为结构化的Action:Objective[#]待办事项列表。 Result: 实验表明该系统生成的评审意见更全面、实用，优于消融基线，在评审质量与专家标准一致性方面表现更优。 Conclusion: 所提系统通过多模态、社区感知与可操作反馈机制，提升了投稿前论文修改的效率与透明度，推动以人为本的学术辅助发展。 Abstract: While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[\#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.

[47] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy

Ramya Kumar,Dhruv Gulwani,Sonit Singh

Main category: cs.CL

TL;DR: 该研究比较了多种机器学习和深度学习模型在基于小规模数据集（600个句子）的Bloom分类法自动分类任务中的表现，发现支持向量机（SVM）结合数据增强效果最佳，准确率、召回率和F1分数均达94%，而复杂模型如RNN和BERT易过拟合，LLM中OpenAI和Gemini零样本表现较好（准确率约72-73%）。

Details

Motivation: 旨在解决教育领域中考试题目与学习目标按Bloom认知分类法自动归类的问题，以减轻教师手工标注负担并提升课程设计效率。 Method: 使用传统机器学习（如SVM、朴素贝叶斯）、循环神经网络（LSTM等）、Transformer模型（BERT、RoBERTa）及大语言模型（OpenAI、Gemini等）对600个标注句子进行分类，并对比不同预处理与数据增强策略（如同义词替换、词嵌入）下的性能。 Result: SVM结合数据增强达到94%的准确率、召回率和F1值，表现最优；RNN和BERT严重过拟合，RoBERTa初期表现良好但后期出现过拟合；大语言模型中OpenAI和Gemini零样本准确率约为72-73%，优于其他LLM。 Conclusion: 在小数据场景下，简单模型结合数据增强优于复杂深度模型；SVM是Bloom分类任务中更稳健高效的选择，而大模型需谨慎应用以避免过拟合或资源浪费。 Abstract: This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.

[48] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

Arsh Gupta,Ajay Narayanan Sridhar,Bonam Mingole,Amulya Yadav

Main category: cs.CL

TL;DR: 本研究引入了一个基于《豪斯医生》剧集的176个罕见病病例数据集，用于评估大语言模型在叙事性医学诊断中的表现，结果显示模型准确率在16.48%至38.64%之间，新一代模型性能提升2.3倍。

Details

Motivation: 探索大语言模型在罕见病诊断这一高难度、低资源场景下的能力，填补其在叙事性临床推理任务中表现的研究空白。 Method: 构建一个经医学教育验证的罕见病症状-诊断数据集，并在GPT 4o mini、GPT 5 mini、Gemini 2.5 Flash和Gemini 2.5 Pro四种先进大模型上进行诊断准确率评估。 Result: 模型间表现差异显著，准确率为16.48%到38.64%，最新一代模型相较前代性能提升2.3倍，但整体仍面临较大诊断挑战。 Conclusion: 当前大语言模型在罕见病诊断任务上仍有局限，但模型演进显示出积极趋势；该数据集为AI辅助诊断研究提供了可公开访问的基准测试框架。 Abstract: Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.

[49] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology

Richard J. Young,Alice M. Matthews

Main category: cs.CL

TL;DR: 本研究提出了一种基于Qwen3-Embedding-8B的领域专用嵌入模型CardioEmbed，通过在七本心脏病学教科书上进行对比学习，显著提升了临床心脏病学文本的语义检索性能，在心脏特异性任务中准确率达99.60%，较现有最优模型MedTE提升15.94个百分点。

Details

Motivation: 现有生物医学文本嵌入模型主要基于PubMed文献，难以有效捕捉临床心脏病学实践中依赖的程序性知识和专业术语，导致在实际临床应用中的表现受限。 Method: 采用对比学习方法，基于Qwen3-Embedding-8B架构，在去重后约15万句来自七本心脏病学教科书的语料上训练CardioEmbed模型，使用InfoNCE损失函数并结合批次内负样本。 Result: 在心脏特异性语义检索任务中达到99.60%的准确率（+15.94个百分点优于MedTE）；在BIOSSES任务上Spearman相关系数为0.77，在SciFact任务上NDCG@10为0.61，显示其在相关生物医学领域的竞争力。 Conclusion: 基于全面临床教科书进行领域专用训练可显著提升心脏病学文本嵌入的检索性能，CardioEmbed在专业临床知识建模方面优于现有通用医学嵌入模型。 Abstract: Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.

[50] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Xiying Zhao,Zhoufutu Wen,Zhixuan Chen,Jingzhe Ding,Jianpeng Jiao,Shuai Li,Xi Li,Danni Liang,Shengda Long,Qianqian Liu,Xianbo Wu,Hongwan Gao,Xiang Gao,Liang Hu,Jiashuo Liu,Mengyun Liu,Weiran Shi,Chenghao Yang,Qianyu Yang,Xuanliang Zhang,Ge Zhang,Wenhao Huang

Main category: cs.CL

TL;DR: 本文提出了DiscoX，一个用于评估专业领域中英翻译的基准，以及Metric-S，一种无需参考译文的自动评估系统，实验表明现有大模型在该任务上仍显著落后于人类专家。

Details

Motivation: 现有的翻译评估方法主要关注片段级别的准确性和流畅性，难以有效评估专业领域中的篇章级翻译质量，因此需要更合适的评估基准和方法。 Method: 构建了一个包含7个领域的200篇专业文本的中英翻译基准DiscoX，并开发了基于多维度评估的无参考自动评分系统Metric-S，用于评估准确性、流畅性和恰当性。 Result: Metric-S与人工评分具有高度一致性，显著优于现有指标；实验显示最先进的大模型在DiscoX上仍远逊于人类译者，验证了该基准的挑战性。 Conclusion: DiscoX和Metric-S为专业领域机器翻译提供了更严格的评估框架，揭示了当前LLM在专业翻译任务上的局限性，推动未来研究的发展。 Abstract: The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

[51] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

Aladin Djuhera,Farhan Ahmed,Swanand Ravindra Kadhe,Syed Zawad,Heiko Ludwig,Holger Boche

Main category: cs.CL

TL;DR: 本文对流行的开源DPO数据集进行了首个全面的数据中心分析，利用Magpie框架标注样本的任务类别、输入质量和偏好奖励，揭示了不同数据集在结构和质量上的差异。基于这些发现，作者构建了一个更小但性能更优的新型DPO混合数据集UltraMix，并公开了所有标注、元数据和混合数据集。

Details

Motivation: 现有大语言模型对齐研究中缺乏对开源DPO数据集的系统性比较，主要受限于计算成本高和缺少细粒度质量标注，难以理解各数据集的样本选择机制及其与人类判断的一致性。 Method: 采用Magpie框架对TuluDPO、ORPO、UltraFeedback、HelpSteer和Code-Preference-Pairs五个主流开源DPO数据集进行自动化标注，分析任务类型分布、输入质量及基于奖励模型的偏好信号（preference reward），进而设计筛选策略构建高质量混合数据集UltraMix。 Result: 发现了不同DPO数据集在奖励边际（reward margins）方面的结构性与质量差异；构建的UltraMix数据集比最佳单一数据集小30%，但在多个关键基准上表现更优。 Conclusion: 通过数据中心的细粒度分析可有效提升DPO数据质量，合理的数据混合与去噪策略能显著提高模型对齐效果，为未来数据驱动的偏好优化研究提供了可复用的资源与方法论基础。 Abstract: Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.

[52] Automata-Based Steering of Large Language Models for Diverse Structured Generation

Xiaokun Luan,Zeming Wei,Yihao Zhang,Meng Sun

Main category: cs.CL

TL;DR: 提出一种基于自动机遍历历史的新方法，以增强大语言模型在结构化生成中的输出多样性，实验表明该方法在保持生成效率的同时显著提升了结构和内容的多样性。

Details

Motivation: 现有的结构化生成方法虽然能保证输出的有效性，但往往缺乏输出多样性，限制了其在实际应用中的效果。 Method: 利用自动机遍历历史来引导大语言模型生成新的结构模式，从而增强生成结果的多样性。 Result: 评估结果显示，该方法在结构多样性和内容多样性上均有显著提升，同时保持了与现有方法相当的生成效率；案例研究显示其在生成开源库测试用例方面有效。 Conclusion: 所提出的方法有效解决了结构化生成中多样性不足的问题，为大语言模型在结构化输出任务中的应用提供了更好的平衡。 Abstract: Large language models (LLMs) are increasingly tasked with generating structured outputs. While structured generation methods ensure validity, they often lack output diversity, a critical limitation that we confirm in our preliminary study. We propose a novel method to enhance diversity in automaton-based structured generation. Our approach utilizes automata traversal history to steer LLMs towards novel structural patterns. Evaluations show our method significantly improves structural and content diversity while maintaining comparable generation efficiency. Furthermore, we conduct a case study showcasing the effectiveness of our method in generating diverse test cases for testing open-source libraries.

[53] Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Xingyu Ren,Youran Sun,Haoyu Liang

Main category: cs.CL

TL;DR: 本文发现当前文本嵌入模型的输出存在一致偏差，提出一种即插即用、无需训练且轻量的重归一化（Renormalization）方法来消除该偏差，并在大规模多语言文本嵌入基准（MMTEB）上验证了其有效性。

Details

Motivation: 现有文本嵌入模型输出中存在一个几乎恒定的均值偏置μ，影响模型性能，需要一种通用且高效的方法来修正这一偏差。 Method: 提出两种重归一化方法：直接从嵌入向量e中减去均值μ，或减去e在μ方向上的投影；该方法无需再训练，可直接应用于现有模型。 Result: 在38个模型上的实验表明，重归一化显著提升了模型在MMTEB上的表现，检索任务性能提升9.7σ，分类任务提升3.1σ，其他任务提升0.8σ，且投影减法变体优于直接减法。 Conclusion: 嵌入向量中的均值偏置是影响模型性能的重要因素，重归一化是一种简单有效、普适性强的改进方法，可在不重新训练的情况下提升各类文本嵌入模型的表现。 Abstract: We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + μ$, where $μ$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $σ$ on retrieval tasks, 3.1 $σ$ on classification tasks, and 0.8 $σ$ on other types of tasks. Renormalization has two variants: directly subtracting $μ$ from $e$, or subtracting the projection of $e$ onto $μ$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.

[54] Can LLMs Detect Their Own Hallucinations?

Sora Kadotani,Kosuke Nishida,Kyosuke Nishida

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）能否检测自身的事实幻觉问题，提出了一种基于思维链（CoT）的分类方法来从模型参数中提取知识。实验结果显示GPT-3.5 Turbo通过CoT能检测出58.2%的自身幻觉，表明若模型参数中含有足够知识，结合CoT即可具备一定幻觉检测能力。

Details

Motivation: 大语言模型虽然能够生成流畅文本，但常出现事实性错误（即幻觉），影响可信度，因此需要探索其自我检测幻觉的能力以提升可靠性。 Method: 将幻觉检测形式化为句子级别的分类任务，构建评估LLM幻觉检测能力的框架，并采用思维链（Chain-of-Thought, CoT）方法引导模型从自身参数中提取知识进行判断。 Result: GPT-3.5 Turbo结合CoT方法能够检测出自身58.2%的幻觉，显著高于无CoT或其他基线方法的表现，验证了该方法的有效性。 Conclusion: 当大语言模型的参数中包含足够知识时，结合思维链推理可以实现一定程度的自我幻觉检测，具备实际应用潜力。 Abstract: Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs' capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT-$3.5$ Turbo with CoT detected $58.2\%$ of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.

[55] Analysing Personal Attacks in U.S. Presidential Debates

Ruban Goyal,Rohitash Chandra,Sonit Singh

Main category: cs.CL

TL;DR: 本研究提出一个用于分析美国总统辩论中人身攻击的框架，结合手动标注和基于语言模型的方法，探讨了微调的变换器模型与通用大语言模型在检测正式政治言论中人身攻击的有效性。

Details

Motivation: 人身攻击在总统辩论中日益普遍，影响公众认知，自动化检测有助于提升政治话语透明度。 Method: 对2016、2020和2024年辩论文本进行人工标注，并采用统计方法与基于BERT及大语言模型的技术进行分析。 Result: 验证了任务特定微调的语言模型在识别政治言论中人身攻击方面的潜力。 Conclusion: 现代语言模型的任务适配可深化对政治传播机制的理解，为媒体与公众提供有力分析工具。 Abstract: Personal attacks have become a notable feature of U.S. presidential debates and play an important role in shaping public perception during elections. Detecting such attacks can improve transparency in political discourse and provide insights for journalists, analysts and the public. Advances in deep learning and transformer-based models, particularly BERT and large language models (LLMs) have created new opportunities for automated detection of harmful language. Motivated by these developments, we present a framework for analysing personal attacks in U.S. presidential debates. Our work involves manual annotation of debate transcripts across the 2016, 2020 and 2024 election cycles, followed by statistical and language-model based analysis. We investigate the potential of fine-tuned transformer models alongside general-purpose LLMs to detect personal attacks in formal political speech. This study demonstrates how task-specific adaptation of modern language models can contribute to a deeper understanding of political communication.

[56] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

Tuochao Chen,Bandhav Veluri,Hongyu Gong,Shyamnath Gollakota

Main category: cs.CL

TL;DR: AV-Dialog 是首个结合音频和视觉线索的多模态对话框架，通过多任务、多阶段训练，在噪声环境中实现鲁棒的语音转录、准确的轮次检测和连贯的响应生成。

Details

Motivation: 现有对话模型在嘈杂、多说话人环境中表现不佳，常产生无关响应和不自然的轮换。因此需要一种能感知目标说话者的鲁棒对话系统。 Method: 提出 AV-Dialog 框架，融合声学标记化与多任务、多阶段训练，利用单人、合成和真实音视频对话数据集进行训练，结合音视频线索追踪目标说话者、预测轮次并生成响应。 Result: 实验表明，AV-Dialog 在干扰环境下优于纯音频模型，显著降低转录错误，提升轮次预测准确性，并改善人类评分的对话质量。 Conclusion: 视听融合显著提升多说话人噪声环境下的对话性能，为现实场景中鲁棒的语音对话代理提供了可行路径。 Abstract: Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.

Yi Shi,Wenlong Meng,Zhenyuan Guo,Chengkun Wei,Wenzhi Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为MemoDetector的新框架，用于提升表情包情感理解（MEU），通过引入四步文本增强模块和双阶段模态融合策略，在MET-MEME和MOOD数据集上显著优于现有方法。

Details

Motivation: 现有方法在细粒度多模态融合和表情包隐含语义与背景知识挖掘方面存在不足，限制了情感理解性能。 Method: 提出MemoDetector框架：1）利用多模态大语言模型（MLLMs）设计四步文本增强模块，逐步推理并提取表情包中的隐含信息；2）采用双阶段模态融合策略，第一阶段对原始图文进行浅层融合，第二阶段深度融合增强后的视觉与文本特征。 Result: 在MET-MEME和MOOD两个数据集上实验表明，该方法F1分数分别提升了4.3%和3.4%，且消融实验证明各组件有效性。 Conclusion: MemoDetector通过增强文本表示和分层融合策略，有效提升了表情包情感理解的性能，展现出在MEU任务中的强大潜力。 Abstract: With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

[58] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Yiming Rong,Yixin Zhang,Ziyi Wang,Deyang Jiang,Yunlong Zhao,Haoran Wu,Shiyu Zhou,Bo Xu

Main category: cs.CL

TL;DR: 提出了一种名为SAP²的新方法，通过两阶段动态剪枝和整合相关上下文关键词，利用语音驱动的注意力池化机制，在SlideSpeech和LibriSpeech数据集上实现了最先进的性能，显著降低了词错误率，尤其在长上下文输入下表现出强健的可扩展性。

Details

Motivation: 现有ASR系统在需要领域特定知识的长上下文场景（如会议演讲）中表现不佳，主要受限于模型上下文窗口和上下文噪声中的信息稀疏问题。 Method: 提出SAP²框架，包含两个阶段，每个阶段使用语音驱动的注意力池化机制，动态剪枝并融合关键上下文信息，有效压缩上下文嵌入并保留语音显著信息。 Result: 在SlideSpeech和LibriSpeech上分别达到7.71%和1.12%的词错误率；在SlideSpeech上，相比无上下文基线，偏见关键词错误率降低41.1%，且在长上下文输入下保持稳定性能。 Conclusion: SAP²能有效利用长上下文中的关键信息，显著提升语音识别在复杂场景下的准确性和鲁棒性，具有良好的可扩展性。 Abstract: Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.

[59] PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

Udo Schlegel,Franziska Weeber,Jian Lan,Thomas Seidl

Main category: cs.CL

TL;DR: 本文提出了一个名为PRSM的新度量方法，用于评估CLIP模型在面对文本释义变化时的稳定性，并利用Social Counterfactuals数据集分析其在性别相关查询中的鲁棒性差异，揭示了不同释义策略下模型表现的不一致性及其对公平性的影响。

Details

Motivation: 尽管CLIP在多模态任务中表现良好，但其对语言变化（尤其是释义）的鲁棒性尚不清楚，而这种鲁棒性在涉及社会敏感性和偏见的场景中至关重要。 Method: 提出了一种新的度量指标PRSM，用于量化CLIP对释义查询的敏感性，并在Social Counterfactuals数据集上进行实验，分析不同释义策略和性别关联查询之间的鲁棒性差异。 Result: 实验表明，CLIP的鲁棒性随释义策略的不同而变化，且在男性和女性关联查询之间存在细微但一致的表现差异。 Conclusion: CLIP对释义的敏感性可能影响其在社会敏感应用中的公平性，需进一步优化以实现更均衡的多模态系统部署。 Abstract: Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

[60] Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy

Jooyoung Lee,Jader Martins Camboim de Sá

Main category: cs.CL

TL;DR: 本文提出了一种语言学基础的副词超义类体系，填补了WordNet中副词分类的空白，并通过标注实验验证了其有效性和可靠性。

Details

Motivation: WordNet虽然为名词和动词提供了丰富的超义层级，但副词缺乏系统的语义分类，限制了其在自然语言处理中的应用。 Method: 基于语言学理论构建副词超义类别体系，并通过人工标注实验进行实证验证。 Result: 提出的分类体系涵盖了自然文本中大部分副词，具有良好的覆盖性和标注一致性。 Conclusion: 该超义体系扩展了WordNet的覆盖范围，更贴近语言学理论，有助于词义消歧、事件抽取、情感分析等下游NLP任务。 Abstract: WordNet offers rich supersense hierarchies for nouns and verbs, yet adverbs remain underdeveloped, lacking a systematic semantic classification. We introduce a linguistically grounded supersense typology for adverbs, empirically validated through annotation, that captures major semantic domains including manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions. Results from a pilot annotation study demonstrate that these categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators. Incorporating this typology extends WordNet's coverage, aligns it more closely with linguistic theory, and facilitates downstream NLP applications such as word sense disambiguation, event extraction, sentiment analysis, and discourse modeling. We present the proposed supersense categories, annotation outcomes, and directions for future work.

[61] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation

Jader Martins Camboim de Sá,Jooyoung Lee,Cédric Pruski,Marcos Da Silveira

Main category: cs.CL

TL;DR: 提出一种名为LANE的对抗训练策略，通过选择性标记训练集中的替换单词，增强模型对目标词的学习关注，从而提升词语表示的区分能力。

Details

Motivation: 神经语言模型常过度拟合于全局句子表示，难以捕捉局部语义细节，导致细粒度词义分辨能力不足。 Method: 提出LANE方法，通过生成带有选择性标记替换单词的负样本，使模型在相同句子中不同标记词之间产生更强的区分能力，采用对抗训练策略提升词表示学习。 Result: 在词汇语义变化检测和词义消歧任务上优于标准对比学习基线，生成更具判别性的词表示，并通过定性分析验证其能更好捕捉细微语义差异。 Conclusion: LANE是一种模型无关的方法，可有效提升神经语言模型的细粒度语义表示能力，适用于现有表示学习框架。 Abstract: Fine-grained word meaning resolution remains a critical challenge for neural language models (NLMs) as they often overfit to global sentence representations, failing to capture local semantic details. We propose a novel adversarial training strategy, called LANE, to address this limitation by deliberately shifting the model's learning focus to the target word. This method generates challenging negative training examples through the selective marking of alternate words in the training set. The goal is to force the model to create a greater separability between same sentences with different marked words. Experimental results on lexical semantic change detection and word sense disambiguation benchmarks demonstrate that our approach yields more discriminative word representations, improving performance over standard contrastive learning baselines. We further provide qualitative analyses showing that the proposed negatives lead to representations that better capture subtle meaning differences even in challenging environments. Our method is model-agnostic and can be integrated into existing representation learning frameworks.

Sania Nayab,Marco Simoni,Giulio Rossolini,Andrea Saracino

Main category: cs.CL

TL;DR: 提出了一种可扩展且确定性的知识图谱问答生成管道，结合大语言模型提升语言质量。

Details

Motivation: 现有方法在可扩展性、语言质量和事实一致性方面存在不足。 Method: 首先基于关系对知识图谱三元组进行聚类，生成自然语言模板，并利用大语言模型优化模板的清晰度和连贯性，最后通过知识图谱中的干扰项选择策略实例化答案选项。 Result: 实验证明该混合方法能高效生成高质量的问答对，在语言流畅性和精确性方面表现优异。 Conclusion: 该方法有效平衡了可扩展性、语言质量和事实准确性，适用于教育平台和大型语言模型的测试与开发。 Abstract: The generation of questions and answers (QA) from knowledge graphs (KG) plays a crucial role in the development and testing of educational platforms, dissemination tools, and large language models (LLM). However, existing approaches often struggle with scalability, linguistic quality, and factual consistency. This paper presents a scalable and deterministic pipeline for generating natural language QA from KGs, with an additional refinement step using LLMs to further enhance linguistic quality. The approach first clusters KG triplets based on their relations, creating reusable templates through natural language rules derived from the entity types of objects and relations. A module then leverages LLMs to refine these templates, improving clarity and coherence while preserving factual accuracy. Finally, the instantiation of answer options is achieved through a selection strategy that introduces distractors from the KG. Our experiments demonstrate that this hybrid approach efficiently generates high-quality QA pairs, combining scalability with fluency and linguistic precision.

[63] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Wei Fan,JinYi Yoon,Bo Ji

Main category: cs.CL

TL;DR: 提出iMAD框架，通过单智能体自批判生成特征，用轻量分类器决定是否触发多智能体辩论，显著降低计算开销并提升准确率。

Details

Motivation: 多智能体辩论（MAD）虽能提升复杂任务的推理准确性，但对每个查询都启动MAD效率低下，且可能推翻正确答案，导致准确率下降和计算资源浪费。 Method: 首先让单个LLM代理生成结构化自我批评响应，从中提取41个语言和语义特征；然后使用基于FocusCal损失训练的轻量级辩论决策分类器判断是否触发MAD，实现高效且准确的决策。 Result: 在六个（视觉）问答数据集上实验表明，相比五种基线方法，iMAD最多减少92%的token使用，并将最终答案准确率提高最多13.5%。 Conclusion: iMAD是一种高效、可泛化的多智能体辩论触发机制，能在保证甚至提升准确率的同时大幅降低计算成本，适用于需要高推理质量与低资源消耗的场景。 Abstract: Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).

[64] destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

Saadat Rafid Ahmed,Rubayet Shareen,Radoan Sharkar,Nazia Hossain,Mansur Mahi,Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: 本文研究并实验了现有的对抗性攻击方法，提出了一种新的攻击策略，通过生成模糊输入来迷惑最先进的机器学习模型，并探讨了提升模型鲁棒性的路径。

Details

Motivation: 近年来机器学习模型在多个领域成功应用，但也暴露出多种安全漏洞，亟需研究其脆弱性以提升系统安全性。 Method: 提出一种新的对抗性攻击方法，利用机器学习和深度学习技术生成高困惑度的对抗样本，针对多种数据集进行实验，并首次将孟加拉语引入对抗攻击研究。 Result: 成功开发出能有效迷惑模型的对抗样本，验证了现有模型在面对模糊输入时的脆弱性，并为未来提升模型鲁棒性提供了方向。 Conclusion: 对抗性攻击对当前NLP模型构成严重威胁，需进一步研究防御机制，同时多语言支持（如孟加拉语）对推动该领域发展具有重要意义。 Abstract: Advancements in Machine Learning & Neural Networks in recent years have led to widespread implementations of Natural Language Processing across a variety of fields with remarkable success, solving a wide range of complicated problems. However, recent research has shown that machine learning models may be vulnerable in a number of ways, putting both the models and the systems theyre used in at risk. In this paper, we intend to analyze and experiment with the best of existing adversarial attack recipes and create new ones. We concentrated on developing a novel adversarial attack strategy on current state-of-the-art machine learning models by producing ambiguous inputs for the models to confound them and then constructing the path to the future development of the robustness of the models. We will develop adversarial instances with maximum perplexity, utilizing machine learning and deep learning approaches in order to trick the models. In our attack recipe, we will analyze several datasets and focus on creating obfuscous adversary examples to put the models in a state of perplexity, and by including the Bangla Language in the field of adversarial attacks. We strictly uphold utility usage reduction and efficiency throughout our work.

[65] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

Jawad Ibn Ahad,Muhammad Rafsan Kabir,Robin Krambroeckers,Sifat Momen,Nabeel Mohammed,Shafin Rahman

Main category: cs.CL

TL;DR: 提出了一种名为Layer-wise Adaptive Ensemble Tuning (LAET)的新方法，通过选择性微调预训练大语言模型的有效层，在降低计算开销的同时提升金融自然语言处理任务的性能，即使使用较小的模型（约3B参数）也能超越GPT-4等现有先进模型。

Details

Motivation: 现有的金融领域大语言模型（如BloombergGPT、FinMA）虽然性能优越，但计算成本高，限制了大多数机构的应用。因此需要一种高效且可扩展的方法来实现实际部署。 Method: 提出LAET方法，通过分析隐藏状态表示，识别并选择性微调预训练LLM中最有效的层，同时冻结不重要的层，从而减少计算开销并提升任务特定性能。 Result: 在金融NLP任务中，LAET方法显著降低了计算需求，并在多个基准测试上超越了包括GPT-4在内的现有最先进模型，即使使用约3B参数的小型模型也表现出更强的性能。 Conclusion: LAET为金融领域的NLP研究与实际应用之间架起了桥梁，提供了一种高效、可扩展且高性能的模型优化方案。 Abstract: Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.

[66] NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery

Anurag J. Vaidya,Felix Meissen,Daniel C. Castro,Shruthi Bannur,Tristan Lazard,Drew F. K. Williamson,Faisal Mahmood,Javier Alvarez-Valle,Stephanie L. Hyland,Kenza Bouzid

Main category: cs.CL

TL;DR: NOVA是一个新型的代理框架，能将科学问题转化为可执行的病理学分析流程，结合49个领域专用工具并支持动态创建新工具，通过SlideQuest基准测试展现出优于现有编码代理的性能，具备在数字病理学中实现可扩展发现的潜力。

Details

Motivation: 数字病理学分析流程复杂且耗时，依赖专业技能，限制了其广泛应用。因此需要一个自动化、智能化的系统来降低技术门槛，提升分析效率和可及性。 Method: 提出NOVA框架，采用基于迭代生成和执行Python代码的智能代理架构，集成49个基于开源软件的领域专用工具，并支持按需创建新工具；同时构建SlideQuest——包含90个经病理学家验证的问题的基准测试集，涵盖数据处理、定量分析和假设检验，用于评估系统的多步推理与编程能力。 Result: 在SlideQuest基准上，NOVA显著优于现有的编码代理基线模型；一个经病理学家验证的案例研究成功将组织形态学特征与预后相关的PAM50亚型关联起来，证明其实际应用价值。 Conclusion: NOVA通过结合代码生成与领域专用工具，实现了对复杂病理学问题的自动化解答，展现出强大的多步推理和计算问题解决能力，为数字病理学研究提供了可扩展、易访问的智能分析新范式。 Abstract: Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark -- verified by pathologists and biomedical scientists -- spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.

[67] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao,Richeng Xuan,Zhaolu Kang,Dingshi Liao,Wenxin Huang,Zongmou Huang,Yangdi Xu,Bowen Qin,Zheqi He,Xi Yang,Changjin Li

Main category: cs.CL

TL;DR: LaoBench是首个针对老挝语的大规模、高质量、多维度基准数据集，旨在评估大语言模型在低资源东南亚语言中的理解与推理能力。

Details

Motivation: 当前大语言模型的发展未与其在低资源语言（尤其是东南亚语言如老挝语）中的评估进展相匹配，缺乏合适的评测基准。 Method: 构建包含17,000多个样本的LaoBench数据集，涵盖知识应用、基础教育和双语翻译三个核心维度，结合专家人工标注与自动化代理验证的数据构建流程，并设立开源与闭源子集以支持公平评估。 Result: 在多个先进大语言模型上的基准测试表明，现有模型在处理老挝语任务时仍面临显著挑战。 Conclusion: LaoBench有望推动对代表性不足的东南亚语言的AI技术研究与发展。 Abstract: The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.

[68] M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text

Salima Lamsiyah,Saad Ezzini,Abdelkader El Mahdaouy,Hamza Alami,Abdessamad Benlahbib,Samir El Amrany,Salmane Chafik,Hicham Hammouchi

Main category: cs.CL

TL;DR: 本文介绍了多领域AI生成文本检测任务（M-DAIGT），包括新闻和学术写作中的二分类检测子任务，并发布了一个包含30,000个样本的大规模基准数据集，吸引了46支注册团队，其中四支提交了结果。

Details

Motivation: 随着大语言模型生成文本的流畅性提高，信息真实性和学术研究面临挑战，亟需有效方法来识别AI生成内容。 Method: 构建包含人类撰写和AI生成文本的平衡数据集，使用多种现代大语言模型（如GPT-4、Claude）和不同提示策略生成AI文本；设立两个二分类子任务（新闻文章检测NAD和学术写作检测AWD），组织共享任务并收集参赛队伍的方法与结果。 Result: 共有46个团队注册，4个团队提交最终结果，所有团队均参与了两个子任务，论文描述了各团队采用的方法。 Conclusion: M-DAIGT为跨领域AI生成文本检测提供了重要基准和数据支持，展示了当前检测技术的现状与挑战，为未来研究指明方向。 Abstract: The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

[69] Studies with impossible languages falsify LMs as models of human language

Jeffrey S. Bowers,Jeff Mitchell

Main category: cs.CL

TL;DR: 语言模型在学习真实语言和不可能语言时表现相似，缺乏人类语言习得中的归纳偏置。

Details

Motivation: 探讨语言模型是否像婴儿一样对真实语言具有学习偏好，以及其在语言习得中是否具备类似人类的归纳偏置。 Method: 回顾相关文献，比较语言模型在学习真实语言与不可能语言（具有非自然结构）时的表现，并分析其学习难度与语言复杂度的关系。 Result: 发现语言模型通常能同样好地学习真实语言和许多不可能语言；难以学习的不可能语言仅因其更高复杂度或随机性；语言模型缺乏人类所具有的促进语言习得的归纳偏置。 Conclusion: 当前语言模型缺少人类语言学习中的内在偏好机制，导致其无法区分语言的自然性，仅受复杂度影响学习效果。 Abstract: According to Futrell and Mahowald [arXiv:2501.17047], both infants and language models (LMs) find attested languages easier to learn than impossible languages that have unnatural structures. We review the literature and show that LMs often learn attested and many impossible languages equally well. Difficult to learn impossible languages are simply more complex (or random). LMs are missing human inductive biases that support language acquisition.

[70] MajinBook: An open catalogue of digital world literature with likes

Antoine Mazières,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文介绍了MajinBook，一个用于计算社会科学和文化分析的开放目录，通过整合影子图书馆和Goodreads的数据，构建了一个包含53.9万种英文书籍的高质量语料库，并讨论了其法律合规性。

Details

Motivation: 为了克服传统语料库（如HathiTrust）中的偏见，并支持计算社会科学与文化分析对大规模、机器可读文本资源的需求，作者旨在开发一个高精度、开放获取的书籍元数据集合。 Method: 通过将影子图书馆（如Library Genesis和Z-Library）的元数据与Goodreads的结构化书目数据进行链接，优先选用原生数字EPUB文件以确保机器可读性，并为英语及法语、德语、西班牙语创建辅助数据集。 Result: 构建了一个涵盖三个世纪、超过53.9万条英文书籍记录的高精度语料库，包含首次出版日期、体裁以及评分和评论等流行度指标，并评估了数据链接策略的准确性，所有数据均公开发布。 Conclusion: MajinBook为研究者提供了一个合法、开放且高质量的跨时代多语言书目数据资源，支持在欧盟和美国法律框架下的文本与数据挖掘研究。 Abstract: This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.

[71] Proactive Hearing Assistants that Isolate Egocentric Conversations

Guilin Hu,Malek Itani,Tuochao Chen,Shyamnath Gollakota

Main category: cs.CL

TL;DR: 提出了一种无需显式提示即可自动识别和分离佩戴者对话伙伴的主动式助听系统，利用自我语音和对话动态实现实时、设备端运行。

Details

Motivation: 传统助听设备需用户手动选择对话对象，无法适应复杂的多说话人环境，因此需要一种能主动感知并适应对话动态的助听系统。 Method: 基于自我语音作为锚点，利用双耳音频和对话轮转行为，采用轻量级流模型（每12.5ms运行）与慢速全局模型结合的双模型架构，在设备端实现实时处理。 Result: 在11名参与者、总计6.8小时的真实2人和3人对话数据集上验证了系统在多对话场景中有效识别和隔离对话伙伴的能力，并表现出良好的泛化性。 Conclusion: 该工作推动了助听设备向能够主动适应对话动态和参与状态的方向发展，为未来智能助听技术提供了新范式。 Abstract: We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer's self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/

[72] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

Zhenyu Ding,Yuhao Wang,Tengyue Xiao,Haoying Wang,Guojun Ma,Mingyang Wan,Caigui Jiang,Ning Ding

Main category: cs.CL

TL;DR: 提出W2S-AlignTree，一种基于蒙特卡洛树搜索与弱到强泛化范式的推理时对齐框架，实现无需参数修改的细粒度生成控制，在多个任务上显著提升大模型表现。

Details

Motivation: 大语言模型在训练时对齐方法（如RLHF）因依赖专家标注而成本高昂且缺乏推理时的动态控制，亟需可扩展、自适应的对齐机制。 Method: 将LLM对齐建模为生成搜索树中的最优启发式搜索问题，利用弱模型的实时步骤信号作为对齐代理，结合熵感知探索机制，通过MCTS在推理时指导强模型生成。 Result: 在情感生成、摘要和指令遵循任务上优于强基线方法；在摘要任务中，使Llama3-8B得分从1.89提升至2.19，相对提升15.9%。 Conclusion: W2S-AlignTree作为一种即插即用的推理时对齐框架，有效实现了无需微调的细粒度控制，具备良好的通用性与应用潜力。 Abstract: Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model's real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model's generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.

[73] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Akyürek,Advait Gosai,Chen Bo Calvin Zhang,Vipul Gupta,Jaehwan Jeong,Anisha Gunjal,Tahseen Rabbani,Maria Mazzone,David Randolph,Mohammad Mahmoudi Meymand,Gurshaan Chattha,Paula Rodriguez,Diego Mares,Pavit Singh,Michael Liu,Subodh Chawla,Pete Cline,Lucy Ogaz,Ernesto Hernandez,Zihao Wang,Pavi Bhatter,Marcos Ayestaran,Bing Liu,Yunzhong He

Main category: cs.CL

TL;DR: 本文提出了PRBench，一个面向金融和法律领域的现实、开放性、高难度的专业推理基准测试，包含1,100个专家设计的任务和19,356条评分标准，由182名专业人士参与构建。评估结果显示现有模型在该基准上表现有限，揭示了其在专业应用场景中的关键缺陷。

Details

Motivation: 现有学术基准无法充分反映模型在真实专业场景（如法律和金融）中的实际表现，缺乏对高风险、经济影响重大的开放式任务的评估。 Method: 构建了一个名为PRBench的开源基准，包含专家撰写的任务和精细化评分标准，覆盖114个国家和47个美国司法管辖区，并通过专业人员独立验证评分标准的有效性，对20个主流模型进行评估。 Result: 当前领先模型在PRBench的困难子集上得分仅为0.39（金融）和0.37（法律），表现出显著提升空间；模型常见错误包括判断不准确、推理过程不透明和推理不完整。 Conclusion: PRBench填补了专业领域高质量评估的空白，揭示了现有大模型在实际专业应用中的可靠性不足，强调需针对具体能力优化模型以支持高风险决策场景。 Abstract: Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.

cs.CV [Back]

[74] A Mathematical Framework for AI Singularity: Conditions, Bounds, and Control of Recursive Improvement

Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari

Main category: cs.CV

TL;DR: 本文提出一个可分析的框架来研究AI能力在递归自我改进下是否可能失控增长，通过资源限制和部署策略建模，定义了区分超线性与亚临界增长的临界边界，并基于工程可观测指标提供可实施的安全控制与可证伪的测试方法。

Details

Motivation: 探讨AI能力在何种可测量条件下可能出现无限期的失控增长（即AI奇点），以及如何从理论上排除这种风险，以替代当前对‘AI爆炸性增长’的推测性讨论。 Method: 建立一个将能力增长与资源扩展和部署政策联系起来的解析框架，结合物理和信息论限制（如功率、带宽、内存）形成服务包络，并构建一个将资本内生化的增长模型，推导出判断失控增长与否的决策规则。 Result: 提出了一个无需仿真的分析框架，能够基于设施功率、IO带宽、训练吞吐量、基准损失和支出等可观测序列生成‘失控’或‘非奇异’行为的是/否证书，并给出了如功率上限、吞吐量节流和评估门控等可直接实施的安全控制措施。案例分析展示了在不同资源约束下增长行为的变化。 Conclusion: AI能力的快速增长是否会导致奇点取决于具体的资源和投资动态，该框架提供了可测试的条件和实际可用的控制手段，从而将关于AI奇点的讨论从猜测转变为可验证和可管理的问题。 Abstract: AI systems improve by drawing on more compute, data, energy, and better training methods. This paper asks a precise, testable version of the "runaway growth" question: under what measurable conditions could capability escalate without bound in finite time, and under what conditions can that be ruled out? We develop an analytic framework for recursive self-improvement that links capability growth to resource build-out and deployment policies. Physical and information-theoretic limits from power, bandwidth, and memory define a service envelope that caps instantaneous improvement. An endogenous growth model couples capital to compute, data, and energy and defines a critical boundary separating superlinear from subcritical regimes. We derive decision rules that map observable series (facility power, IO bandwidth, training throughput, benchmark losses, and spending) into yes/no certificates for runaway versus nonsingular behavior. The framework yields falsifiable tests based on how fast improvement accelerates relative to its current level, and it provides safety controls that are directly implementable in practice, such as power caps, throughput throttling, and evaluation gates. Analytical case studies cover capped-power, saturating-data, and investment-amplified settings, illustrating when the envelope binds and when it does not. The approach is simulation-free and grounded in measurements engineers already collect. Limitations include dependence on the chosen capability metric and on regularity diagnostics; future work will address stochastic dynamics, multi-agent competition, and abrupt architectural shifts. Overall, the results replace speculation with testable conditions and deployable controls for certifying or precluding an AI singularity.

[75] Semantic VLM Dataset for Safe Autonomous Driving

Yuankai He,Weisong Shi

Main category: cs.CV

TL;DR: CAR-Scenes是一个用于自动驾驶的帧级视觉语言数据集，包含5,192张图像的细粒度标注，覆盖28类知识体系和350多个属性，支持可解释的场景理解与风险感知场景挖掘。

Details

Motivation: 为了提升自动驾驶中视觉语言模型对复杂驾驶场景的可解释性和细粒度理解能力，现有数据集在语义丰富性、标注一致性及任务可复现性方面存在不足。 Method: 基于GPT-4o辅助的视觉语言标注流水线，结合人工验证，使用28类知识体系对来自多个主流数据集的图像进行标注，并构建属性共现图与JSONL格式数据；提供标准化评估流程与LoRA-tuned Qwen2-VL-2B等基线模型。 Result: 发布了包含350+叶属性的高质量标注数据集，配套标注脚本、提示词、后处理规则及评估工具；实现了基于确定性解码的可复现基线性能，并支持语义检索、数据筛选和风险感知分析。 Conclusion: CAR-Scenes为自动驾驶中的视觉语言理解提供了数据与工具基础，推动以数据为中心、可解释的智能车辆系统研究。 Abstract: CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes

[76] Fast Data Attribution for Text-to-Image Models

Sheng-Yu Wang,Aaron Hertzmann,Alexei A Efros,Richard Zhang,Jun-Yan Zhu

Main category: cs.CV

TL;DR: 提出一种可扩展且高效的数据归因方法，通过将慢速的基于遗忘的归因方法蒸馏到特征嵌入空间，实现对文本到图像模型中高影响力训练图像的快速检索。

Details

Motivation: 现有数据归因方法在每次查询时都需要大量计算资源，难以应用于实际场景。 Method: 将基于遗忘的归因方法蒸馏到特征嵌入空间，并结合高效的索引与搜索技术，实现快速检索高影响力训练图像。 Result: 在MSCOCO上的中等规模模型和LAION训练的Stable Diffusion大模型上均取得良好效果，可在几秒内完成检索，比现有方法快2,500至400,000倍。 Conclusion: 该方法为在Stable Diffusion等现实世界模型中大规模应用数据归因技术迈出了重要一步。 Abstract: Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

[77] Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow

Pooja P Jain,Pietro Mascagni,Giuseppe Massimiani,Nabani Banik,Marta Goglia,Lorenzo Arboit,Britty Baby,Andrea Balla,Ludovica Baldari,Gianfranco Silecchia,Claudio Fiorillo,CompSurg Colorectal Experts Group,Sergio Alfieri,Salvador Morales-Conde,Deborah S Keller,Luigi Boni,Nicolas Padoy

Main category: cs.CV

TL;DR: 本研究开发并验证了一种名为ColoWorkflow的基于视频的评估工具，用于分析微创结直肠手术的工作流程。该工具通过德尔菲法达成共识，具有良好的适用性和中等的评分者间可靠性，是首个针对此类手术的标准化VBA工具。

Details

Motivation: 微创结直肠手术存在操作变异性和学习曲线陡峭等问题，现有工作流程分析工具难以标准化。因此需要一种可推广、可靠且易于实施的视频评估工具来优化培训和提升手术质量。 Method: 采用德尔菲法确定通用手术阶段和步骤，据此开发ColoWorkflow工具，并在多中心腹腔镜和机器人结直肠手术视频数据集上由独立评分员进行应用，评估其适用性和评分者间一致性（使用Cohen's Kappa系数）。 Result: 达成共识的10个通用阶段和34个特定步骤被纳入ColoWorkflow；该工具应用于54例手术视频，显示广泛适用性（仅一个标签未使用），阶段和步骤的平均Cohen's Kappa分别为0.71和0.66，主要差异出现在阶段转换和步骤边界定义处。 Conclusion: ColoWorkflow是首个基于共识且经过验证的微创结直肠手术视频工作流程分析工具，可为手术绩效评估提供可重复的框架，支持跨机构基准比较和人工智能驱动的工作流程识别，有助于标准化培训和推动数据驱动的外科质量改进。 Abstract: Minimally invasive colorectal surgery is characterized by procedural variability, a difficult learning curve, and complications that impact quality and outcomes. Video-based assessment (VBA) offers an opportunity to generate data-driven insights to reduce variability, optimize training, and improve surgical performance. However, existing tools for workflow analysis remain difficult to standardize and implement. This study aims to develop and validate a VBA tool for workflow analysis across minimally invasive colorectal procedures. A Delphi process was conducted to achieve consensus on generalizable workflow descriptors. The resulting framework informed the development of a new VBA tool, ColoWorkflow. Independent raters then applied ColoWorkflow to a multicentre video dataset of laparoscopic and robotic colorectal surgery (CRS). Applicability and inter-rater reliability were evaluated. Consensus was achieved for 10 procedure-agnostic phases and 34 procedure-specific steps describing CRS workflows. ColoWorkflow was developed and applied to 54 colorectal operative videos (left and right hemicolectomies, sigmoid and rectosigmoid resections, and total proctocolectomies) from five centres. The tool demonstrated broad applicability, with all but one label utilized. Inter-rater reliability was moderate, with mean Cohen's K of 0.71 for phases and 0.66 for steps. Most discrepancies arose at phase transitions and step boundary definitions. ColoWorkflow is the first consensus-based, validated VBA tool for comprehensive workflow analysis in minimally invasive CRS. It establishes a reproducible framework for video-based performance assessment, enabling benchmarking across institutions and supporting the development of artificial intelligence-driven workflow recognition. Its adoption may standardize training, accelerate competency acquisition, and advance data-informed surgical quality improvement.

[78] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

Junjie Zhang,Feng Zhao,Hanqiang Liu,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种面向遥感多模态泛化的频率感知视觉-语言网络（FVMGN），通过扩散增强、小波解耦和多尺度对齐模块，提升跨场景、跨模态的遥感图像分类性能。

Details

Motivation: 遥感技术发展迅速，但现有视觉-语言模型在处理多模态数据异质性和跨场景泛化方面存在不足，且缺乏针对不同遥感模态的专有语言先验知识。 Method: 提出FVMGN框架，包括扩散式训练-测试增强策略（DTAug）、多模态小波解耦模块（MWDis）、共享与专有文本输入设计、空间-频率感知图像编码器（SFIE）以及多尺度空间-频率特征对齐模块（MSFFA），实现跨模态语义统一。 Result: 实验表明，FVMGN在多模态泛化能力上优于当前最先进的方法，能有效应对遥感图像中的数据异质性和跨场景分类挑战。 Conclusion: FVMGN通过融合频率域分析与视觉-语言对齐，为遥感多模态泛化提供了有效解决方案，具有较强的跨模态、跨场景适应能力。 Abstract: The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.

[79] GFT: Graph Feature Tuning for Efficient Point Cloud Analysis

Manish Dhakal,Venkat R. Dasari,Raj Sunderraman,Yi Ding

Main category: cs.CV

TL;DR: 提出了一种针对点云数据的参数高效微调方法Graph Features Tuning (GFT)，通过轻量级图卷积网络学习动态图结构，并利用跳跃连接和交叉注意力模块传递图特征，显著减少可训练参数，在分类与分割任务上表现优异。

Details

Motivation: 现有的通用参数高效微调方法在点云数据上表现不佳，且可训练参数较多，因此需要一种专为点云设计、更高效的微调方法。 Method: 提出Graph Features Tuning (GFT)，利用轻量级图卷积网络从Transformer的初始token化输入中学习动态图结构，并通过跳跃连接和高效的交叉注意力模块将图特征传递到深层。 Result: 在物体分类和分割任务上，GFT在性能媲美现有方法的同时，显著减少了可训练参数数量。 Conclusion: GFT是一种高效的点云专用PEFT方法，在保持性能的同时大幅降低参数量，适用于资源受限场景下的点云模型微调。 Abstract: Parameter-efficient fine-tuning (PEFT) significantly reduces computational and memory costs by updating only a small subset of the model's parameters, enabling faster adaptation to new tasks with minimal loss in performance. Previous studies have introduced PEFTs tailored for point cloud data, as general approaches are suboptimal. To further reduce the number of trainable parameters, we propose a point-cloud-specific PEFT, termed Graph Features Tuning (GFT), which learns a dynamic graph from initial tokenized inputs of the transformer using a lightweight graph convolution network and passes these graph features to deeper layers via skip connections and efficient cross-attention modules. Extensive experiments on object classification and segmentation tasks show that GFT operates in the same domain, rivalling existing methods, while reducing the trainable parameters. Code is at https://github.com/manishdhakal/GFT.

[80] Accuracy-Preserving CNN Pruning Method under Limited Data Availability

Daisuke Yasui,Toshitaka Matsuki,Hiroshi Sato

Main category: cs.CV

TL;DR: 提出一种基于LRP的新型剪枝方法，在小数据场景下实现高剪枝率同时更好地保持模型精度。

Details

Motivation: 现有基于LRP的剪枝方法在不进行微调的情况下虽具潜力，但仍存在显著精度下降问题，限制了其实际应用。 Method: 利用层相关性传播（LRP）技术设计新的剪枝策略，在无需微调的前提下，通过更精细的相关性分析实现高效剪枝。 Result: 在少量数据条件下，该方法实现了更高的剪枝率，并且相比现有方法更好地保持了模型准确性。 Conclusion: 所提方法在资源受限和数据有限的场景下具有更好的实用性和性能优势，推动了无需微调的模型压缩技术的发展。 Abstract: Convolutional Neural Networks (CNNs) are widely used in image recognition and have succeeded in various domains. CNN models have become larger-scale to improve accuracy and generalization performance. Research has been conducted on compressing pre-trained models for specific target applications in environments with limited computing resources. Among model compression techniques, methods using Layer-wise Relevance Propagation (LRP), an explainable AI technique, have shown promise by achieving high pruning rates while preserving accuracy, even without fine-tuning. Because these methods do not require fine-tuning, they are suited to scenarios with limited data. However, existing LRP-based pruning approaches still suffer from significant accuracy degradation, limiting their practical usability. This study proposes a pruning method that achieves a higher pruning rate while preserving better model accuracy. Our approach to pruning with a small amount of data has achieved pruning that preserves accuracy better than existing methods.

[81] Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

Seoik Jung,Taekyung Song,Yangro Lee,Sungjun Lee

Main category: cs.CV

TL;DR: 提出了一种短窗口滑动学习框架，用于实时暴力检测，通过将视频分割为1-2秒的片段并利用大语言模型自动标注，实现了高精度的暴力事件识别。

Details

Motivation: 传统基于长视频训练的方法难以捕捉快速发生的暴力事件，且标注成本高，因此需要一种更高效、精确的实时检测方法。 Method: 将视频划分为1-2秒的短片段，使用大语言模型进行自动字幕标注，构建细粒度数据集，并在所有帧上保留时间连续性以增强检测精度。 Result: 在RWF-2000数据集上达到95.25%的准确率，在UCF-Crime长视频数据集上达到83.25%的准确率，显著提升长视频上的表现。 Conclusion: 所提方法具有强泛化能力和实时应用潜力，适用于智能监控系统中的暴力检测任务。 Abstract: This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25\% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25\%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.

[82] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

Feng Li,Ke Wu,Yongwei Li

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态情感识别的多模态交叉注意力网络与对比学习方法（MCN-CL），通过三重查询机制和硬负样本挖掘策略，有效解决了模态异质性、类别不平衡等问题，在IEMOCAP和MELD数据集上显著优于现有方法。

Details

Motivation: 现有方法在处理多模态情感识别时面临类别分布不平衡、面部动作单元时序建模复杂以及模态异质性导致特征融合困难三大挑战，且社交媒体中多模态数据快速增长，亟需高效的跨模态融合框架。 Method: 提出MCN-CL模型，结合多模态交叉注意力网络与对比学习，采用三重查询机制和硬负样本挖掘策略，以去除特征冗余并保留关键情感线索，提升跨模态融合效率。 Result: 在IEMOCAP和MELD数据集上的实验表明，该方法的加权F1分数分别比现有最优方法提高了3.42%和5.73%。 Conclusion: MCN-CL能有效应对多模态情感识别中的模态异质性和类别不平衡问题，显著提升识别性能，具有较强的实用性和推广潜力。 Abstract: Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.

[83] DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting

Luciano Araujo Dourado Filho,Almir Moreira da Silva Neto,Anthony Miyaguchi,Rodrigo Pereira David,Rodrigo Tripodi Calumby,Lukáš Picek

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练卫星视觉编码器和视频投影器的高效概率降雨临近预报方法，在Weather4Cast 2025基准上显著优于3D-UNET模型。

Details

Motivation: 为了提升降雨临近预报的准确性和计算效率，需要一种能够有效利用卫星图像并进行概率预测的模型。 Method: 采用预训练的DINOv3-SAT493M卫星视觉编码器，结合V-JEPA Vision Transformer作为视频投影器，并附加轻量级概率头，将编码器输出映射为4小时累积降雨量的经验累积分布函数（eCDF），并通过连续排名概率评分（CRPS）进行端到端优化。 Result: 在Weather4Cast 2025基准测试中，该方法取得了3.5102的CRPS，相比最佳3D-UNET模型性能提升了约26%。 Conclusion: 所提出的方法在概率降雨预报中表现出色，兼具竞争力和计算效率，验证了结合预训练编码器与轻量概率建模的有效性。 Abstract: This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3\text{-}SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Continuous Ranked Probability Score (CRPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102 (CRPS), which represents $\approx$26\% in effectiveness gain against the best 3D-UNET.

[84] YOLO-Drone: An Efficient Object Detection Approach Using the GhostHead Network for Drone Images

Hyun-Ki Jung

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11的改进型无人机目标检测模型YOLO-Drone，通过引入GhostHead网络增强Head部分，提升了在高海拔图像下的检测精度与速度。

Details

Motivation: 无人机拍摄图像通常来自高海拔，导致目标小且难以识别，现有模型在精度和速度上仍有提升空间。 Method: 基于YOLOv11n，在其Head网络中引入GhostHead结构以增强特征表达能力，并在VisDrone数据集上进行实验验证。 Result: YOLO-Drone在Precision、Recall、F1-Score和mAP(0.5)上均优于原始YOLOv11，分别提升0.4%、0.6%、0.5%和0.5%，推理速度也有所提高；相比YOLOv8、YOLOv9和YOLOv10，mAP(0.5)分别提升0.1%、0.3%和0.6%。 Conclusion: YOLO-Drone在保持高效推理的同时显著提升了检测性能，是一种适用于无人机场景的高性能目标检测模型。 Abstract: Object detection using images or videos captured by drones is a promising technology with significant potential across various industries. However, a major challenge is that drone images are typically taken from high altitudes, making object identification difficult. This paper proposes an effective solution to address this issue. The base model used in the experiments is YOLOv11, the latest object detection model, with a specific implementation based on YOLOv11n. The experimental data were sourced from the widely used and reliable VisDrone dataset, a standard benchmark in drone-based object detection. This paper introduces an enhancement to the Head network of the YOLOv11 algorithm, called the GhostHead Network. The model incorporating this improvement is named YOLO-Drone. Experimental results demonstrate that YOLO-Drone achieves significant improvements in key detection accuracy metrics, including Precision, Recall, F1-Score, and mAP (0.5), compared to the original YOLOv11. Specifically, the proposed model recorded a 0.4% increase in Precision, a 0.6% increase in Recall, a 0.5% increase in F1-Score, and a 0.5% increase in mAP (0.5). Additionally, the Inference Speed metric, which measures image processing speed, also showed a notable improvement. These results indicate that YOLO-Drone is a high-performance model with enhanced accuracy and speed compared to YOLOv11. To further validate its reliability, comparative experiments were conducted against other high-performance object detection models, including YOLOv8, YOLOv9, and YOLOv10. The results confirmed that the proposed model outperformed YOLOv8 by 0.1% in mAP (0.5) and surpassed YOLOv9 and YOLOv10 by 0.3% and 0.6%, respectively.

[85] PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Zihan Gu,Ruoyu Chen,Junchi Zhang,Yue Hu,Hua Zhang,Xiaochun Cao

Main category: cs.CV

TL;DR: 提出PhaseWin算法，实现近线性复杂度的高效、高保真区域归因，显著减少计算开销并保持接近贪婪算法的性能。

Details

Motivation: 现有基于子模选择的归因方法虽保真度高，但计算效率低，难以在实际场景中部署。 Method: 提出PhaseWin，采用分阶段粗到精搜索，结合自适应剪枝、窗口化精细选择和动态监督机制，替代传统二次复杂度的贪婪选择。 Result: PhaseWin在仅20%计算预算下达到超过95%的贪婪算法归因保真度，在目标检测和视觉定位任务中优于现有方法。 Conclusion: PhaseWin在保持高保真归因的同时大幅提升效率，为多模态模型提供了可扩展的归因新范式。 Abstract: Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.

[86] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

Zhixia He,Chen Zhao,Minglai Shao,Xintao Wu,Xujiang Zhao,Dong Li,Qin Tian,Linlin Yu

Main category: cs.CV

TL;DR: 提出了一种基于正负提示监督的视觉-语言模型方法，通过优化类别特定的提示并结合图结构来增强OOD检测性能，在多个基准上优于现有方法。

Details

Motivation: 现有的基于负提示的OOD检测方法可能因包含过多非ID特征而导致次优结果，难以准确区分ID与OOD样本。 Method: 利用大语言模型初始化类别相关的正负提示，分别捕捉类内和类边界的语义特征；通过图结构聚合语义信息并传递到视觉分支，增强能量基OOD检测器的判别能力。 Result: 在CIFAR-100和ImageNet-1K两个基准上，涵盖八个OOD数据集和五种不同LLM的实验表明，该方法显著优于现有最先进方法。 Conclusion: 正负提示监督结合图结构能有效提升视觉-语言模型在OOD检测中的性能，增强了语义与视觉模态之间的对齐与判别能力。 Abstract: Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

[87] Facial Expression Recognition with YOLOv11 and YOLOv12: A Comparative Study

Umma Aymon,Nur Shazwani Kamarudin,Ahmad Fakhri Ab. Nasir

Main category: cs.CV

TL;DR: 本研究评估了YOLOv11n和YOLOv12n在面部表情识别（FER）中的性能，使用FER2013和KDEF数据集。YOLOv12n在mAP指标上表现更优，敏感性更强；而YOLOv11n精度更高，适用于噪声环境。研究揭示了轻量级YOLO模型在灵敏度与精度间的权衡，适用于实时、资源受限的情绪感知AI应用。

Details

Motivation: 在无约束的真实环境中，面部表情识别仍具挑战性。研究旨在探索最新轻量级YOLO模型在统一检测与分类框架下的FER性能，以实现高效且准确的实时应用。 Method: 将FER2013和KDEF两个分类数据集转换为对象检测格式，采用YOLOv11n和YOLOv12n模型进行训练与评估，使用mAP 0.5、精确率、召回率和混淆矩阵衡量性能。 Result: YOLOv12n在KDEF上达到95.6的mAP 0.5，在FER2013上mAP为63.8，表现出更强的表达敏感性；YOLOv11n在FER2013上精度达65.2%，假阳性更少，在噪声环境下更可靠。两模型在FER2013中对相似表情易混淆，而在KDEF上类别分离更清晰。 Conclusion: 轻量级YOLO模型能在性能与效率间取得良好平衡，YOLOv12n更适合高敏感性需求场景，YOLOv11n更适合高精度要求的真实复杂环境，二者均适合部署于资源受限的实时情绪识别系统。 Abstract: Facial Expression Recognition remains a challenging task, especially in unconstrained, real-world environments. This study investigates the performance of two lightweight models, YOLOv11n and YOLOv12n, which are the nano variants of the latest official YOLO series, within a unified detection and classification framework for FER. Two benchmark classification datasets, FER2013 and KDEF, are converted into object detection format and model performance is evaluated using mAP 0.5, precision, recall, and confusion matrices. Results show that YOLOv12n achieves the highest overall performance on the clean KDEF dataset with a mAP 0.5 of 95.6, and also outperforms YOLOv11n on the FER2013 dataset in terms of mAP 63.8, reflecting stronger sensitivity to varied expressions. In contrast, YOLOv11n demonstrates higher precision 65.2 on FER2013, indicating fewer false positives and better reliability in noisy, real-world conditions. On FER2013, both models show more confusion between visually similar expressions, while clearer class separation is observed on the cleaner KDEF dataset. These findings underscore the trade-off between sensitivity and precision, illustrating how lightweight YOLO models can effectively balance performance and efficiency. The results demonstrate adaptability across both controlled and real-world conditions, establishing these models as strong candidates for real-time, resource-constrained emotion-aware AI applications.

[88] Heterogeneous Complementary Distillation

Liuchi Xu,Hao Zheng,Lu Wang,Lisheng Xu,Jun Cheng

Main category: cs.CV

TL;DR: 提出了一种简单而有效的异构互补蒸馏框架HCD，通过整合教师和学生的互补特征来对齐共享logits中的表示，并引入子logit解耦蒸馏和正交性损失以提升知识迁移的多样性和效率，在多个数据集上优于现有方法。

Details

Motivation: 现有的知识蒸馏方法在处理异构架构（如Vision Transformer到ResNet）时面临特征表示差异大、计算成本高、过度依赖logit对齐等问题，难以有效迁移互补特征。 Method: 提出Heterogeneous Complementary Distillation (HCD)，利用卷积投影和自适应池化处理学生中间特征，与教师倒数第二层特征拼接后通过互补特征映射模块(CFM)生成共享logits；进一步设计子logit解耦蒸馏(SDD)将共享logits分解为n个子logits并与教师logits融合进行分类纠正，并引入正交性损失(OL)促进多样性并减少冗余。 Result: 在CIFAR-100、细粒度数据集（如CUB200）和ImageNet-1K上的实验表明，HCD显著优于当前最先进的知识蒸馏方法，尤其在异构架构下表现突出。 Conclusion: HCD是一种高效且通用的异构知识蒸馏框架，能够有效融合教师与学生的互补特征，在降低设计复杂度的同时提升学生模型的泛化能力和鲁棒性。 Abstract: Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student's intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher's feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher's logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.

[89] Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation

Xingyue Zhao,Wenke Huang,Xingguang Wang,Haoyu Zhao,Linghao Zhuang,Anwen Jiang,Guancheng Wan,Mang Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为FedBCS的新方法，用于解决联邦学习中医学图像分割因设备和协议差异导致的特征异质性问题，通过频域自适应风格重校准和上下文感知的双层原型对齐，实现了更鲁棒的域不变特征表示。

Details

Motivation: 由于不同医疗设备和扫描协议带来的特征异质性，现有联邦学习方法在医学图像分割中存在上下文表示不完整和中间层风格偏差累积的问题，限制了模型的泛化能力。 Method: 提出FedBCS方法，引入频域自适应风格重校准以解耦内容与风格表示并学习最优风格参数；设计上下文感知的双层原型对齐机制，融合编码器与解码器多层特征及上下文信息，实现更精细的特征对齐。 Result: 在两个公开数据集上进行了大量实验，结果表明所提方法在性能上显著优于现有方法，表现出更强的鲁棒性和分割精度。 Conclusion: FedBCS通过构建域不变的上下文原型并进行多层级对齐，有效缓解了特征异质性问题，为联邦医学图像分割提供了新的解决方案。 Abstract: Federated learning enables multiple medical institutions to train a global model without sharing data, yet feature heterogeneity from diverse scanners or protocols remains a major challenge. Many existing works attempt to address this issue by leveraging model representations (e.g., mean feature vectors) to correct local training; however, they often face two key limitations: 1) Incomplete Contextual Representation Learning: Current approaches primarily focus on final-layer features, overlooking critical multi-level cues and thus diluting essential context for accurate segmentation. 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. Specifically, we introduce a frequency-domain adaptive style recalibration into prototype construction that not only decouples content-style representations but also learns optimal style parameters, enabling more robust domain-invariant prototypes. Furthermore, we design a context-aware dual-level prototype alignment method that extracts domain-invariant prototypes from different layers of both encoder and decoder and fuses them with contextual information for finer-grained representation alignment. Extensive experiments on two public datasets demonstrate that our method exhibits remarkable performance.

[90] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu,Fangneng Zhan,Kaichen Zhou,Yilun Du,Paul Pu Liang,Hanspeter Pfister

Main category: cs.CV

TL;DR: SandboxVLM是一种无需额外训练即可提升视觉语言模型（VLM）3D推理能力的框架，通过引入抽象边界框和多视角感知流程，在零样本设置下显著增强了空间智能。

Details

Motivation: 现有VLM在处理3D任务（如空间认知和物理理解）时表现不佳，因其基于2D数据训练，导致从2D输入中提取3D信息效率低下，难以满足机器人和具身智能体等实际应用需求。 Method: 提出SandboxVLM框架，设计包含多视角先验生成、代理高度估计、多视角投票与聚类以及3D感知推理四个阶段的3D Sandbox重建与感知流程，利用抽象边界框编码几何结构和物理运动学信息，以桥接3D任务与2D训练之间的模态差距。 Result: 在多个基准和VLM主干网络上进行零样本评估，SandboxVLM一致提升了空间智能，例如在SAT Real上相比基线方法提高了8.3%。 Conclusion: 为VLM引入3D抽象表示能显著增强其3D推理能力，且无需额外训练，为通用具身智能的发展提供了新方向。 Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

[91] DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition

Ren Zhang,Huilai Li,Chao qi,Guoliang Xu,Tianyu Zhou,Wei wei,Jianqin Yin

Main category: cs.CV

TL;DR: 本文提出DEFT-LLM，通过多专家解耦架构和运动语义对齐的Uni-MER数据集，解决微表情识别中静态外观与动态运动纠缠及文本标签与面部肌肉运动语义不一致的问题，实现更精准、可解释的微表情识别。

Details

Motivation: 现有微表情识别方法难以有效分离静态外观与动态运动，且文本标签无法准确反映面部肌肉运动，导致模型关注力分散且语义不一致。 Method: 构建基于光流和动作单元（AU）标签双重约束的运动驱动指令数据集Uni-MER，并设计包含结构、动态纹理和运动语义三个专家的多专家解耦架构，实现面部动态的独立表征与语义对齐。 Result: 在多个具有挑战性的微表情识别基准上达到最先进性能，尤其在局部面部运动的可解释建模方面表现突出。 Conclusion: DEFT-LLM通过解耦表示和物理先验注入，有效提升了微表情识别的准确性与可解释性，为基于大语言模型的视觉情感分析提供了新思路。 Abstract: Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

[92] Language-Guided Graph Representation Learning for Video Summarization

Wenrui Li,Wei Han,Hengyu Man,Wangmeng Zuo,Xiaopeng Fan,Yonghong Tian

Main category: cs.CV

TL;DR: 提出了一种语言引导的图表示学习网络（LGRLN）用于视频摘要，通过构建视频图和跨模态嵌入模块提升摘要质量，并显著减少模型参数和推理时间。

Details

Motivation: 现有方法难以捕捉视频内容的全局依赖关系和多模态用户定制需求，且帧间时间邻近性不总代表语义邻近性。 Method: 设计了视频图生成器，构建前向、后向和无向图以保留时序和上下文关系；引入基于双阈值图卷积的图内关系推理模块；提出语言引导的跨模态嵌入模块，结合文本描述生成摘要；将摘要生成建模为混合伯努利分布并用EM算法求解。 Result: 在多个基准上优于现有方法，推理时间和模型参数分别减少了87.8%和91.7%。 Conclusion: LGRLN有效提升了视频摘要的语义准确性和效率，支持基于文本的个性化摘要生成。 Abstract: With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.

[93] Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition

Gunho Jung,Heejo Kong,Seong-Whan Lee

Main category: cs.CV

TL;DR: 提出了一种文本引导的弱监督框架TG-DFER，用于动态面部表情识别，通过引入语义指导和多粒度时序建模来提升性能。

Details

Motivation: 解决动态面部表情识别中因视频帧众多但仅有一个标签而导致的多对一标注问题，以及现有MIL方法在情感表达视觉多样性和时序复杂性上的不足。 Method: 结合视觉-语言预训练模型提供语义指导，引入视觉提示对齐文本情绪标签与视觉特征，并设计多粒度时序网络以捕捉短期面部动态和长期情感流。 Result: 实验表明，TG-DFER在弱监督下具有更好的泛化能力、可解释性和时序敏感性。 Conclusion: TG-DFER通过文本引导和时序建模有效提升了动态面部表情识别的性能，为弱监督学习提供了新思路。 Abstract: Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.

[94] ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

Anzhe Cheng,Shukai Duan,Shixuan Li,Chenzhong Yin,Mingxi Cheng,Heng Ping,Tamoghna Chattopadhyay,Sophia I Thomopoulos,Shahin Nazarian,Paul Thompson,Paul Bogdan

Main category: cs.CV

TL;DR: ERMoE是一种新的稀疏Mixture-of-Experts架构，通过在学习的正交特征基上重新参数化专家并使用基于输入与专家表示空间相似性的特征得分进行路由，解决了传统MoE中路由不稳定和专家利用不均的问题，无需额外负载均衡损失，实现了更稳定、可解释且高性能的专家专业化。

Details

Motivation: 传统的MoE模型存在路由器logits与专家内部结构不匹配导致的路由不稳定和专家利用不足，以及负载不均衡引发的性能瓶颈。现有方法如辅助负载均衡损失会削弱专家的专业化能力，影响下游任务表现。 Method: 提出ERMoE：将每个专家重新参数化为学习到的正交特征基，并用‘特征得分’（输入特征与专家基之间的余弦相似度）替代原有的学习型门控logits，实现内容感知的路由，使令牌分配直接关联专家的表示空间，从而稳定利用并提升可解释性，同时去除显式的负载均衡损失。 Result: ERMoE在ImageNet分类和跨模态图文检索（如COCO、Flickr30K）任务上达到SOTA精度，自然产生更均衡的专家负载分布；其3D MRI变体ERMoE-ba在脑龄预测上准确率提升超7%，并展现出解剖学上可解释的专家分工。 Conclusion: ERMoE提出了一种新型稀疏专家模型架构设计原则，通过将路由机制与专家内部结构对齐，有效解决路由不稳定和负载不均问题，在无需显式平衡损失的情况下实现更高性能、更好可扩展性和可解释性的专家专业化。 Abstract: Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert's internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an "Eigenbasis Score", defined as the cosine similarity between input features and an expert's basis. This content-aware routing ties token assignments directly to experts' representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7\% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.

Haoran Chen,Houze Xu,Micah Goldblum,Daoguo Dong,Zuxuan Wu

Main category: cs.CV

TL;DR: 本文提出DMC和DMC-OT两种方法，用于解决基于CLIP的类别增量学习中的分类器偏差和分布漂移问题，通过解耦视觉编码器与文本软提示的学习过程，并引入最优传输校准策略，在多个数据集上实现了最先进的性能。

Details

Motivation: 在类别增量学习中，基于CLIP的模型容易因缺乏旧类数据而导致文本原型过拟合于新类别，产生分类器偏差；同时，视觉编码器更新引发的记忆统计分布漂移问题也被现有方法忽视。 Method: 提出两阶段框架DMC，分别独立训练视觉编码器和文本软提示，以保持跨模态对齐；进一步设计DMC-OT，引入最优传输引导的校准策略来对齐不同阶段的记忆统计，并采用任务特定提示增强任务间可分性。 Result: 在CIFAR-100、Imagenet-R、CUB-200和UCF-101数据集上，DMC和DMC-OT均达到最先进性能，其中DMC-OT平均提升准确率1.80%。 Conclusion: 通过解耦模态训练和校准记忆分布，所提方法有效缓解了分类器偏差与分布漂移问题，显著提升了基于CLIP的类别增量学习性能。 Abstract: Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

[96] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

Bowen Sun,Yujun Cai,Ming-Hsuan Yang,Hang Wu,Yiwei Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Phase Aggregated Smoothing (PAS) 的训练免费方法，用于缓解视频大语言模型中由多模态RoPE引发的时间不一致性问题，通过跨注意力头应用相位偏移并聚合输出，有效平滑时间核，提升时序稳定性。

Details

Motivation: Video LLMs存在时间不一致问题：帧时序的微小变化会导致注意力分布剧烈波动，影响模型稳定性，其根源在于多模态RoPE引入的逆傅里叶时间核在相邻帧上产生不稳定的缩放因子。 Method: 提出Phase Aggregated Smoothing (PAS)，在不同注意力头中引入微小且相反的相位偏移，随后聚合各头输出，在不改变位置编码结构的前提下平滑时间核，降低对相位的敏感性。 Result: 理论分析表明PAS能近似保持内容点积并平滑时间核，实现注意力对微小时间偏移的Lipschitz稳定性；实验显示其在多个视频理解基准上显著提升性能，且计算开销可忽略。 Conclusion: PAS是一种即插即用、无需训练的改进方案，有效增强了Video LLMs的时间编码鲁棒性，为现有模型提供了简单高效的升级路径。 Abstract: Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.

[97] Binary Verification for Zero-Shot Vision

Jeffrey Liu,Rongbin Hu

Main category: cs.CV

TL;DR: 提出了一种无需训练的二值化验证流程，通过量化和二值化提升现有视觉语言模型在零样本视觉任务中的表现。

Details

Motivation: 现有的零样本视觉语言模型在开放性查询上表现受限，需要更有效的推理策略来提升准确性和通用性。 Method: 将开放性问题先转化为多选题（MCQ），再通过真假判断（True/False）对每个候选答案进行二值化验证，并采用确定性规则进行最终选择。 Result: 在指代表达定位、空间推理和BLINK-Jigsaw等任务上，该方法显著优于直接生成式回答，且具有良好的泛化能力。 Conclusion: 该训练-free的二值验证流程通过推理时设计而非模型训练，为当前VLM提供了一种实用且统一的零样本视觉性能提升路径。 Abstract: We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

[98] Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation

Daxin Li,Yuanchao Bai,Kai Wang,Wenbo Zhao,Junjun Jiang,Xianming Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于分层并行和渐进适应的高效自回归模型HPAC，用于学习型无损图像压缩，在多种数据集上实现了新的最先进性能。

Details

Motivation: 自回归模型虽理论性能优越，但因计算成本高而被认为不实用，本文旨在重新思考这一范式，使其兼具高性能与实用性。 Method: 提出HPAC模型，采用分层因子化结构和内容感知卷积门控；引入CSI和AFC两种优化技术，并通过SARP-FT策略进行渐进式微调。 Result: 在自然、卫星、医学图像等多类数据集上达到最先进的压缩性能，同时保持小参数量和有竞争力的编码速度。 Conclusion: 精心设计的纯自回归框架可在低计算开销下实现显著性能提升，重新确立了其在无损压缩中的领先地位。 Abstract: Autoregressive (AR) models, the theoretical performance benchmark for learned lossless image compression, are often dismissed as impractical due to prohibitive computational cost. This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation that re-establishes pure autoregression as a top-performing and practical solution. Our approach is embodied in the Hierarchical Parallel Autoregressive ConvNet (HPAC), an ultra-lightweight pre-trained model using a hierarchical factorized structure and content-aware convolutional gating to efficiently capture spatial dependencies. We introduce two key optimizations for practicality: Cache-then-Select Inference (CSI), which accelerates coding by eliminating redundant computations, and Adaptive Focus Coding (AFC), which efficiently extends the framework to high bit-depth images. Building on this efficient foundation, our progressive adaptation strategy is realized by Spatially-Aware Rate-Guided Progressive Fine-tuning (SARP-FT). This instance-level strategy fine-tunes the model for each test image by optimizing low-rank adapters on progressively larger, spatially-continuous regions selected via estimated information density. Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression. Notably, our approach sets a new benchmark in learned lossless compression, showing a carefully designed AR framework can offer significant gains over existing methods with a small parameter count and competitive coding speeds.

[99] CLUE: Controllable Latent space of Unprompted Embeddings for Diversity Management in Text-to-Image Synthesis

Keunwoo Park,Jihye Chae,Joong Ho Ahn,Jihoon Kweon

Main category: cs.CV

TL;DR: CLUE是一种基于Stable Diffusion的文本到图像生成框架，通过固定格式提示和风格编码器实现无需额外数据的多样化且稳定的图像生成，在耳部医学图像数据增强中表现出显著性能提升。

Details

Motivation: 在医学等专业领域，由于数据类型有限且数量不足，现有的文本到图像生成方法难以兼顾生成多样性与稳定性，因此需要一种不依赖额外数据即可实现高质量生成的方法。 Method: 提出CLUE框架，基于Stable Diffusion架构引入风格编码器生成风格嵌入，并将其注入U-Net的新第二注意力层；利用KL散度使潜在空间以高斯区域连续表示图像特征，实现与提示无关的稳定生成。 Result: 在中耳炎数据集上，CLUE将FID从46.81降至9.30，召回率从49.60%提升至70.29%；合成数据训练的分类器在1000%规模下F1达83.21%（对比73.83%）；合成与真实数据结合F1达94.76%，优于仅用真实数据；在外部队列中合成数据F1为76.77%（对比60.61%），联合使用达到85.78%。 Conclusion: CLUE能够在数据受限的专业领域实现多样化且稳定的图像生成，有效支持数据增强，提升下游分类模型性能，具有在特定领域应用的潜力。 Abstract: Text-to-image synthesis models require the ability to generate diverse images while maintaining stability. To overcome this challenge, a number of methods have been proposed, including the collection of prompt-image datasets and the integration of additional data modalities during training. Although these methods have shown promising results in general domains, they face limitations when applied to specialized fields such as medicine, where only limited types and insufficient amounts of data are available. We present CLUE (Controllable Latent space of Unprompted Embeddings), a generative model framework that achieves diverse generation while maintaining stability through fixed-format prompts without requiring any additional data. Based on the Stable Diffusion architecture, CLUE employs a Style Encoder that processes images and prompts to generate style embeddings, which are subsequently fed into a new second attention layer of the U-Net architecture. Through Kullback-Leibler divergence, the latent space achieves continuous representation of image features within Gaussian regions, independent of prompts. Performance was assessed on otitis media dataset. CLUE reduced FID to 9.30 (vs. 46.81) and improved recall to 70.29% (vs. 49.60%). A classifier trained on synthetic-only data at 1000% scale achieved an F1 score of 83.21% (vs. 73.83%). Combining synthetic data with equal amounts of real data achieved an F1 score of 94.76%, higher than when using only real data. On an external dataset, synthetic-only training achieved an F1 score of 76.77% (vs. 60.61%) at 1000% scale. The combined approach achieved an F1 score of 85.78%, higher than when using only the internal dataset. These results demonstrate that CLUE enables diverse yet stable image generation from limited datasets and serves as an effective data augmentation method for domain-specific applications.

Jiajun Chen,Sai Cheng,Yutao Yuan,Yirui Zhang,Haitao Yuan,Peng Peng,Yi Zhong

Main category: cs.CV

TL;DR: 提出了一种名为PROMISE的新型多模态框架，通过提示注意力分层对比学习，在模态缺失情况下实现鲁棒的跨模态表示。

Details

Motivation: 现有方法在模态缺失时无法保持跨模态一致性，导致性能下降。 Method: 将多模态提示学习引入分层对比学习框架，并设计了提示注意力机制以动态生成鲁棒表示。 Result: 在多个基准数据集上实验表明，PROMISE优于当前最先进的多模态方法。 Conclusion: PROMISE有效缩小了完整与不完整数据间的表示差距，提升了模态缺失下的模型鲁棒性。 Abstract: Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE innovatively incorporates multimodal prompt learning into a hierarchical contrastive learning framework, equipped with a specially designed prompt-attention mechanism. This mechanism dynamically generates robust and consistent representations for scenarios where particular modalities are absent, thereby effectively bridging the representational gap between complete and incomplete data. Extensive experiments conducted on benchmark datasets, along with comprehensive ablation studies, clearly demonstrate the superior performance of PROMISE compared to current state-of-the-art multimodal methods.

[101] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Zongyang Qiu,Bingyuan Wang,Xingbei Chen,Yingqing He,Zeyu Wang

Main category: cs.CV

TL;DR: 本文提出了EmoVid，首个面向创意媒体的情感标注视频数据集，并基于此开发了情感条件下的视频生成方法，显著提升了生成视频的情感表达与质量。

Details

Motivation: 现有视频生成系统多关注低层次视觉指标，忽视情感维度；且缺乏将情感理解与生成任务结合的资源，尤其在非真实感视频场景中。 Method: 构建包含卡通、电影片段和动画贴纸的多模态情感标注数据集EmoVid，标注情绪标签、视觉属性和文本描述；分析视觉特征与情感感知的时空模式，并基于Wan2.1模型微调实现情感条件下的视频生成。 Result: 系统分析揭示了不同类型视频中视觉特征与情感的关联模式；所提方法在文本到视频和图像到视频任务中均显著提升定量指标和生成质量。 Conclusion: EmoVid为情感视频计算建立了新基准，不仅深化了艺术风格视频中的情感分析理解，也为增强视频生成中的情感表达提供了实用方法。 Abstract: Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

[102] MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis

Yiran Song,Yikai Zhang,Shuang Zhou,Guojun Xiong,Xiaofeng Yang,Nian Wang,Fenglong Ma,Rui Zhang,Mingquan Lin

Main category: cs.CV

TL;DR: 提出MeCaMIL，一种因果感知的多实例学习框架，通过结构化因果图显式建模人口统计学混杂因素，提升全切片图像分析中的公平性与可解释性。

Details

Motivation: 现有MIL方法在计算病理学中缺乏因果可解释性，且未整合患者人口统计信息，导致跨人群的算法偏见和公平性问题，阻碍临床应用。 Method: 构建基于因果推断（do-演算与碰撞结构）的MIL框架MeCaMIL，将人口统计变量作为混杂因子进行建模，分离疾病相关信号与虚假的人口统计相关性。 Result: 在CAMELYON16、TCGA-Lung和TCGA-Multi三个基准上达到SOTA性能；公平性显著提升，人口统计差异方差平均降低65%以上；在生存预测任务中C-index提升0.017；消融实验验证因果结构关键作用。 Conclusion: MeCaMIL为数字病理学提供了一个原则性强、公平、可解释且具临床实用性的AI框架。 Abstract: Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology, achieving strong diagnostic performance through patch-level feature aggregation. However, existing MIL methods face critical limitations: (1) they rely on attention mechanisms that lack causal interpretability, and (2) they fail to integrate patient demographics (age, gender, race), leading to fairness concerns across diverse populations. These shortcomings hinder clinical translation, where algorithmic bias can exacerbate health disparities. We introduce \textbf{MeCaMIL}, a causality-aware MIL framework that explicitly models demographic confounders through structured causal graphs. Unlike prior approaches treating demographics as auxiliary features, MeCaMIL employs principled causal inference -- leveraging do-calculus and collider structures -- to disentangle disease-relevant signals from spurious demographic correlations. Extensive evaluation on three benchmarks demonstrates state-of-the-art performance across CAMELYON16 (ACC/AUC/F1: 0.939/0.983/0.946), TCGA-Lung (0.935/0.979/0.931), and TCGA-Multi (0.977/0.993/0.970, five cancer types). Critically, MeCaMIL achieves superior fairness -- demographic disparity variance drops by over 65% relative reduction on average across attributes, with notable improvements for underserved populations. The framework generalizes to survival prediction (mean C-index: 0.653, +0.017 over best baseline across five cancer types). Ablation studies confirm causal graph structure is essential -- alternative designs yield 0.048 lower accuracy and 4.2x times worse fairness. These results establish MeCaMIL as a principled framework for fair, interpretable, and clinically actionable AI in digital pathology. Code will be released upon acceptance.

[103] Draft and Refine with Visual Experts

Sungheon Jeong,Ryozo Masukawa,Jihong Park,Sanggeon Yun,Wenjun Huang,Hanning Chen,Mahdi Imani,Mohsen Imani

Main category: cs.CV

TL;DR: 提出Draft and Refine（DnR）框架，通过量化视觉信息利用程度来增强大视觉语言模型的视觉 grounding，减少幻觉。

Details

Motivation: 现有大视觉语言模型过度依赖语言先验而忽视视觉证据，导致生成不 grounded 或幻觉响应，缺乏对视觉信息使用程度的定量衡量。 Method: 设计基于问题条件的利用率度量方法，构建相关性图定位关键视觉线索，并通过相关性引导的概率掩码测量视觉依赖；DnR 框架利用该指标结合外部视觉专家（如检测框、掩码）提供反馈，渲染为图像线索后重新查询模型，选择使利用率提升最大的响应进行 refine。 Result: 在 VQA 和图像描述任务上实验显示准确率持续提升且幻觉减少，验证了该方法有效增强视觉 grounding。 Conclusion: 通过量化和优化视觉信息利用，可在无需微调或修改架构的情况下提升多模态模型的可解释性和基于证据的推理能力。 Abstract: While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.

[104] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Xinlei Yu,Chengming Xu,Guibin Zhang,Zhangquan Chen,Yudong Zhang,Yongbo He,Peng-Tao Jiang,Jiangning Zhang,Xiaobin Hu,Shuicheng Yan

Main category: cs.CV

TL;DR: 提出VisMem框架，通过模拟人类认知记忆机制，在视觉语言模型中引入动态潜在视觉记忆，显著提升模型在视觉理解、推理和生成任务中的性能。

Details

Motivation: 解决视觉语言模型在复杂视觉任务中因视觉证据丢失和缺乏情境化视觉体验而导致的“视觉处理瓶颈”问题。 Method: 受人类认知记忆理论启发，设计包含短期（细粒度感知保持）和长期（抽象语义整合）模块的动态潜在视觉记忆系统，并在推理过程中动态调用这些记忆以维持感知保真度和语义一致性。 Result: 在多个视觉基准测试上实验表明，相比基础模型平均性能提升11.8%，且优于所有对比方法。 Conclusion: VisMem为增强视觉语言模型的记忆能力提供了新范式，有效缓解了视觉处理瓶颈，推动了模型在复杂任务中的表现。 Abstract: Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a "visual processing bottleneck": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.

[105] SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation

Sumin Yu,Taesup Moon

Main category: cs.CV

TL;DR: SP-Guard是一种针对扩散模型的图像生成安全方法，能够根据提示估计危害性并选择性地对不安全区域进行引导，从而在减少有害内容的同时避免不必要的内容修改。

Details

Motivation: 现有的推理时引导方法缺乏自适应性和选择性，无法有效应对由扩散模型生成有害内容所带来的社会风险。 Method: 提出SP-Guard方法，通过估计提示词的危害程度，并生成选择性引导掩码，在生成过程中仅对检测到的不安全区域施加安全引导。 Result: 实验表明，SP-Guard相比现有方法能生成更安全的图像，同时显著减少对非目标区域的干扰和内容失真。 Conclusion: SP-Guard提升了文本到图像生成的安全性，强调了生成过程中的透明性和可控性的重要性。 Abstract: While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity--adjusting guidance strength based on the prompt--and selectivity--targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.

[106] SUPER Decoder Block for Reconstruction-Aware U-Net Variants

Siheon Joo,Hongjo Kim

Main category: cs.CV

TL;DR: 提出了一种名为SUPER的可插拔解码器模块，利用小波的完美重建性质并选择性抑制冗余特征，有效提升U-Net类模型在高低频场景下的重建质量与表征能力。

Details

Motivation: 现有U-Net变体在反问题中存在信息损失，难以恢复高频细节，限制了性能提升。 Method: 设计了具有完美重建（PR）特性和选择性抑制（SS）机制的SUPER模块，作为即插即用组件集成到多种U-Net架构中，避免信息退化并增强表征多样性。 Result: 在裂缝分割和手机图像去噪等多个任务上验证了SUPER的有效性：在CrackVision12K上显著提升细裂缝分割效果，在SIDD数据集上也取得PSNR增益，且计算成本相当。 Conclusion: SUPER通过重建感知的设计，在不增加计算负担的前提下，实现了跨U-Net架构的通用性，并在高低频主导的任务中均表现出色，解决了信息瓶颈问题。 Abstract: Skip-connected encoder-decoder architectures (U-Net variants) are widely adopted for inverse problems but still suffer from information loss, limiting recovery of fine high-frequency details. We present Selectively Suppressed Perfect Reconstruction (SUPER), which exploits the perfect reconstruction (PR) property of wavelets to prevent information degradation while selectively suppressing (SS) redundant features. Free from rigid framelet constraints, SUPER serves as a plug-and-play decoder block for diverse U-Net variants, eliminating their intrinsic reconstruction bottlenecks and enhancing representational richness. Experiments across diverse crack benchmarks, including state-of-the-art (SOTA) models, demonstrate the structural potential of the proposed SUPER Decoder Block. Maintaining comparable computational cost, SUPER enriches representational diversity through increased parameterization. In small-scale in-domain experiments on the CrackVision12K dataset, SUPER markedly improves thin-crack segmentation performance, particularly for cracks narrower than 4 px, underscoring its advantage in high-frequency dominant settings. In smartphone image denoising on SIDD, where low-frequency components prevail, SUPER still achieves a moderate gain in PSNR, confirming its robustness across low- and high-frequency regimes. These results validate its plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified, reconstruction-aware framework.

[107] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha,Yuxuan Fan,Tianyu Zhang,Geng Chen,Yingfeng Chen,Chen Gao,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出了AirCopBench，首个用于评估多模态大语言模型（MLLMs）在具身空中协同感知中表现的综合基准，涵盖挑战性感知条件下的多种任务，并通过40种MLLM的评测揭示了当前模型在协同感知任务中的性能差距。

Details

Motivation: 现有基准多针对单传感器、高质量图像的简单感知任务，缺乏对多智能体协同感知特别是真实世界退化感知条件下MLLM表现的评估，因此需要构建更贴近实际复杂场景的评测基准。 Method: 基于模拟器和真实世界数据构建AirCopBench，包含14.6k+问题，覆盖四个关键任务维度和14种任务类型；采用模型生成、规则生成和人工标注结合的方法，在严格质量控制下构建大规模问题集，并对40种MLLM进行评测，同时开展仿真到现实的微调实验。 Result: 在40种MLLM上的评估显示，最佳模型在协同感知任务上平均落后人类24.38%，且在不同任务间表现不一致；微调实验验证了仿真到真实场景迁移的可行性。 Conclusion: AirCopBench填补了MLLM在多智能体协同感知评估方面的空白，揭示了当前模型在复杂、退化感知条件下的局限性，为未来研究提供了重要基准和方向。 Abstract: Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

[108] EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition

Yong Sun,Zhengjie Zhang,Junyu Shi,Zhiyuan Zhang,Lijiang Liu,Qiang Nie

Main category: cs.CV

TL;DR: 提出EmbryoDiff，一种基于扩散模型的两阶段框架，用于体外受精胚胎发育阶段的精细识别，通过多焦点特征融合和语义-边界条件模块提升分类准确性。

Details

Motivation: 现有深度学习方法未能利用胚胎发育的分布先验，且依赖单焦点信息导致表示不完整，易受细胞遮挡影响。 Method: 设计两阶段扩散模型：第一阶段训练并冻结帧级编码器提取多焦点特征；第二阶段采用多焦点特征融合策略构建3D感知表征，并结合语义与边界线索通过混合条件模块注入扩散过程。 Result: 在两个基准数据集上达到最先进性能，仅用一次去噪步骤即分别实现82.8%和81.3%的平均准确率。 Conclusion: EmbryoDiff有效整合分布先验与多焦点信息，显著提升胚胎发育阶段识别的鲁棒性与准确性。 Abstract: Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.

[109] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types

Chi-Yu Chen,Rawan Abulibdeh,Arash Asgari,Leo Anthony Celi,Deirdre Goode,Hassan Hamidi,Laleh Seyyed-Kalantari,Po-Chih Kuo,Ned McCague,Thomas Sounack

Main category: cs.CV

TL;DR: 深度学习模型能从正常胸部X光片中预测患者的医疗保险类型（社会经济地位的代理），揭示医学影像中隐藏的社会不平等痕迹，挑战了医学图像为纯粹生物数据的传统假设。

Details

Motivation: 探索深度学习模型是否能从医学影像中提取与社会经济地位相关的隐含信息，并检验这些信息是否独立于人口统计学变量存在。 Method: 使用先进的深度视觉模型（DenseNet121、SwinV2-B、MedMamba）在MIMIC-CXR-JPG和CheXpert数据集上训练并预测患者的保险类型；通过控制年龄、种族和性别变量以及进行基于patch的遮挡分析来定位信号来源。 Result: 模型在两个数据集上均能显著预测保险类型（AUC约0.67–0.68），信号在控制人口变量后仍存在，且在单一族群内训练仍可检测；信号分布广泛，主要位于胸腔中上部。 Conclusion: 医学影像可能编码了临床环境和社会结构性差异的隐性特征，AI模型能够捕捉这些‘社会指纹’；这要求重新思考医疗AI中的公平性问题——需从数据源头解析并解耦社会偏见，而不仅仅是调整模型或平衡数据集。 Abstract: Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient's health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.

[110] Accelerating Controllable Generation via Hybrid-grained Cache

Lin Liu,Huixia Ben,Shuo Wang,Jinda Lu,Junxiang Qiu,Shengeng Tang,Yanbin Hao

Main category: cs.CV

TL;DR: 提出了一种混合粒度缓存（HGC）方法，通过在不同计算阶段采用多粒度缓存策略，显著降低可控生成模型的计算开销，同时保持良好的视觉质量。

Details

Motivation: 可控生成模型在提升合成视觉内容真实性方面广泛应用，但其对控制条件和生成过程的计算需求较高，导致生成效率低。因此需要提高生成效率。 Method: 提出混合粒度缓存（HGC）：1）在编码器-解码器块间采用基于特征复用的粗粒度（块级）缓存，动态跳过冗余计算；2）在模块内设计细粒度（提示级）缓存，复用跨注意力图并扩展到相邻步骤的模块计算中。两种缓存可无缝集成到生成流程的各计算环节。 Result: 在四个基准数据集上验证了HGC的有效性，尤其在生成效率与视觉质量之间实现了良好平衡。例如，在COCO-Stuff分割任务上，计算量（MACs）减少了63%（从18.22T降至6.70T），语义保真度损失控制在1.5%以内。 Conclusion: HGC通过多粒度缓存机制有效降低了可控生成模型的计算开销，显著提升了生成效率，同时保持了较高的生成质量，具有广泛的应用潜力。 Abstract: Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

[111] MPCGNet: A Multiscale Feature Extraction and Progressive Feature Aggregation Network Using Coupling Gates for Polyp Segmentation

Wei Wang,Feng Jiang,Xin Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于结肠息肉分割的新型网络MPCGNet，通过引入耦合门机制设计了三个模块以应对小尺寸息肉易漏检、边界模糊和图像噪声等问题，在多个数据集上取得了优于现有方法的性能。

Details

Motivation: 现有息肉分割方法在处理小尺寸息肉、模糊边界和噪声干扰方面存在不足，亟需更鲁棒的模型提升临床辅助诊断准确性。 Method: 提出了包含耦合门多尺度特征提取（CGMFE）、窗口交叉注意力解码器（WCAD）和解码器特征聚合（DFA）三个模块的MPCGNet，利用耦合门进行噪声抑制与特征重要性选择，增强对小息肉的检测与边界细节恢复能力。 Result: 在ETIS-LaribPolypDB和CVC-ColonDB数据集上，MPCGNet的mDice分别比第二好的模型高出2.20%和0.68%，表现出优越的分割性能。 Conclusion: MPCGNet通过耦合门机制有效提升了息肉分割的精度，尤其在小息肉识别和噪声抑制方面表现突出，具有良好的临床应用潜力。 Abstract: Automatic segmentation methods of polyps is crucial for assisting doctors in colorectal polyp screening and cancer diagnosis. Despite the progress made by existing methods, polyp segmentation faces several challenges: (1) small-sized polyps are prone to being missed during identification, (2) the boundaries between polyps and the surrounding environment are often ambiguous, (3) noise in colonoscopy images, caused by uneven lighting and other factors, affects segmentation results. To address these challenges, this paper introduces coupling gates as components in specific modules to filter noise and perform feature importance selection. Three modules are proposed: the coupling gates multiscale feature extraction (CGMFE) module, which effectively extracts local features and suppresses noise; the windows cross attention (WCAD) decoder module, which restores details after capturing the precise location of polyps; and the decoder feature aggregation (DFA) module, which progressively aggregates features, further extracts them, and performs feature importance selection to reduce the loss of small-sized polyps. Experimental results demonstrate that MPCGNet outperforms recent networks, with mDice scores 2.20% and 0.68% higher than the second-best network on the ETIS-LaribPolypDB and CVC-ColonDB datasets, respectively.

[112] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Pooja Singh,Siddhant Ujjain,Tapan Kumar Gandhi,Sandeep Kumar

Main category: cs.CV

TL;DR: 本文提出了CrossMed，一个用于评估医学多模态大语言模型在未见模态-解剖-任务组合下组合泛化能力的基准。通过将四个公开数据集统一为视觉问答格式，构建了20,200个测试样本。实验表明，现有模型在相关设置下表现良好，但在无关和零重叠设置下性能显著下降，凸显了该任务的挑战性。同时展示了跨任务迁移的有效性，且多模态大模型在组合泛化上优于传统模型。

Details

Motivation: 当前医学多模态大模型在视觉与文本统一处理上取得进展，但其在成像模态、解剖部位和任务类型之间的组合泛化能力尚不明确，缺乏系统性评估基准。 Method: 提出CrossMed基准，采用模态-解剖-任务（MAT）结构化框架，整合CheXpert、SIIM-ACR、BraTS 2020和MosMedData四个公共数据集，统一为多选视觉问答格式。构建Related、Unrelated及零重叠三种划分方式，评估LLaVA-Vicuna-7B和Qwen2-VL-7B等模型的组合泛化能力，并与ResNet-50、U-Net等传统模型对比。 Result: 在Related split上，模型达到83.2%分类准确率和0.75分割cIoU；在Unrelated和零重叠设置下性能显著下降。跨任务迁移使分割cIoU提升7%。传统模型增益有限，而多模态大模型在组合泛化方面表现更优。 Conclusion: CrossMed为医学多模态模型提供了严格的测试平台，可用于评估零样本、跨任务和跨模态的组合泛化能力，推动更具泛化性的通用医学AI发展。 Abstract: Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

[113] SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices

Jiaming Huang,Yi Gao,Fuchang Pan,Renjie Li,Wei Dong

Main category: cs.CV

TL;DR: 本文提出SemanticNN，一种面向极弱嵌入式设备的语义编解码方法，通过容忍比特级错误来保证语义级正确性，实现高效、鲁棒的设备-边缘协同推理。

Details

Motivation: 物联网快速发展推动AI在资源极度受限的嵌入式设备上部署，但传统方法关注比特级传输正确性，在动态信道条件下效率低下。需要一种更高效的容错机制以适应资源和网络限制。 Method: 提出SemanticNN，包括BER感知解码器以适应动态信道条件，软量化（SQ）编码器学习紧凑表示；引入特征增强学习提升卸载效率，并利用XAI-based非对称补偿机制缓解编解码端能力不匹配问题。 Result: 在STM32上基于三个模型和六个数据集进行实验，结果表明SemanticNN在不同误码率下将特征传输量减少56.82-344.83倍，同时保持较高的推理精度。 Conclusion: SemanticNN通过语义级容错机制显著提升了设备-边缘协同推理的通信效率与鲁棒性，适用于资源受限且网络不稳定的应用场景。 Abstract: With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy. However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems. Traditional approaches focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints. It incorporates a Bit Error Rate (BER)-aware decoder that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations. Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency. To address encoder-decoder capability mismatches from asymmetric resources, we propose XAI-based Asymmetry Compensation to enhance decoding semantic fidelity. We conduct extensive experiments on STM32 using three models and six datasets across image classification and object detection tasks. Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy.

[114] Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval

Wenrui Li,Yidan Lu,Yeyu Chai,Rui Zhao,Hengyu Man,Xiaopeng Fan

Main category: cs.CV

TL;DR: 提出了一种基于双曲空间的文本-3D检索模型H²ARN，解决了层次表示坍塌和冗余导致显著性稀释的问题，并发布了更大的T3DR-HIT v2基准。

Details

Motivation: 现有文本-3D检索方法面临层次表示坍塌（HRC）和冗余诱导显著性稀释（RISD）两大挑战，影响对细粒度语义和难负样本的区分能力。 Method: 将文本和3D数据嵌入洛伦兹模型的双曲空间，利用指数体积增长保持层次结构；设计层次排序损失构建蕴含锥，结合实例对比损失增强判别能力；提出贡献感知的双曲聚合模块，基于洛伦兹距离加权局部特征以抑制冗余。 Result: 在新发布的T3DR-HIT v2基准（8,935对，是原版2.6倍）上验证了方法的有效性，显著优于现有方法，尤其在细粒度文物和复杂室内场景中表现突出。 Conclusion: H²ARN通过双曲空间建模有效缓解了HRC和RISD问题，提升了文本-3D检索的准确性和鲁棒性，为未来研究提供了更大规模的评估基准。 Abstract: With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model's ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H$^{2}$ARN) for text-3D retrieval. H$^{2}$ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.

[115] PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI

Sun Jo,Seok Young Hong,JinHyun Kim,Seungmin Kang,Ahjin Choi,Don-Gwan An,Simon Song,Je Hyeong Hong

Main category: cs.CV

TL;DR: 提出PINGS-X框架，利用轴对齐的时空高斯表示实现4D flow MRI数据的高效超分辨率重建，显著缩短训练时间并提高精度。

Details

Motivation: 现有基于物理信息神经网络（PINN）的MRI超分辨率方法需为每位患者单独训练，耗时过长，限制了临床应用。 Method: 受3D高斯溅射（3DGS）启发，提出PINGS-X，采用轴对齐的时空高斯表示，引入归一化高斯溅射、高斯合并策略等创新，提升训练效率与稳定性。 Result: 在计算流体动力学（CFD）和真实4D flow MRI数据集上验证，PINGS-X显著减少训练时间，同时实现更优的超分辨率精度。 Conclusion: PINGS-X为4D flow MRI提供了高效、准确的超分辨率解决方案，具有良好的临床应用前景。 Abstract: 4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at https://github.com/SpatialAILab/PINGS-X.

[116] NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion

Chuheng Chen,Xiaofei Zhou,Geyuan Zhang,Yong Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于投影的LoRA融合方法NP-LoRA，通过将不同LoRA映射到正交子空间来避免结构干扰，提升生成内容的主题保真度和风格一致性。

Details

Motivation: 现有LoRA融合方法因权重合并导致一个LoRA主导另一个，产生结构性干扰，影响生成质量。 Method: 利用SVD提取风格主方向，并将主体LoRA投影到其正交零空间；引入软投影机制以平衡主题保真与风格一致性。 Result: 实验表明NP-LoRA在多种骨干网络和LoRA组合上均优于强基线方法，在DINO、CLIP指标及人类与LLM偏好评分中表现更优。 Conclusion: NP-LoRA通过子空间分离有效缓解了LoRA融合中的结构性干扰，实现了无需重训练的高质量可控生成。 Abstract: Low-Rank Adaptation (LoRA) fusion has emerged as a key technique for reusing and composing learned subject and style representations for controllable generation without costly retraining. However, existing methods rely on weight-based merging, where one LoRA often dominates the other, leading to interference and degraded fidelity. This interference is structural: separately trained LoRAs occupy low-rank high-dimensional subspaces, leading to non-orthogonal and overlapping representations. In this work, we analyze the internal structure of LoRAs and find their generative behavior is dominated by a few principal directions in the low-rank subspace, which should remain free from interference during fusion. To achieve this, we propose Null Space Projection LoRA (NP-LoRA), a projection-based framework for LoRA fusion that enforces subspace separation to prevent structural interference among principal directions. Specifically, we first extract principal style directions via singular value decomposition (SVD) and then project the subject LoRA into its orthogonal null space. Furthermore, we introduce a soft projection mechanism that enables smooth control over the trade-off between subject fidelity and style consistency. Experiments show NP-LoRA consistently improves fusion quality over strong baselines (e.g., DINO and CLIP-based metrics, with human and LLM preference scores), and applies broadly across backbones and LoRA pairs without retraining.

[117] CareCom: Generative Image Composition with Calibrated Reference Features

Jiaxuan Chen,Bo Zhang,Qingdong He,Jinlong Peng,Li Niu

Main category: cs.CV

TL;DR: 提出了一种多参考图像的生成式图像合成方法，通过校准全局和局部特征以实现前景细节保持与姿态/视角调整的平衡。

Details

Motivation: 现有图像合成方法难以同时保持前景细节并调整其姿态或视角。 Method: 扩展生成模型为多参考版本，并提出校准前景参考图像的全局与局部特征，使其与背景信息兼容。 Result: 在MVImgNet和MureCom数据集上的实验表明，引入校准后的参考特征显著提升了生成效果。 Conclusion: 所提出的特征校准机制有效增强了多参考图像合成中细节保留与姿态调整的能力。 Abstract: Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

[118] LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich,Tony Wu,Aviad Dahan,Yuval Domb

Main category: cs.CV

TL;DR: 提出LiteAttention方法，利用扩散模型注意力在去噪过程中的时间连贯性，通过早期标记非关键区域并传播跳过决策，显著加速视频生成中的Diffusion Transformers，且不损失质量。

Details

Motivation: 现有加速方法在动态稀疏注意力和静态稀疏模式之间存在计算开销与效率的权衡，难以兼顾自适应性和低延迟。 Method: 观察到扩散注意力的稀疏模式在去噪步骤间具有强时间相干性，提出LiteAttention，通过早期识别非必要tile并跨去噪序列传播计算跳过决策，结合动态方法的适应性与静态方法的高效性。 Result: 在生产级视频扩散模型上实现了显著的加速效果，基于FlashAttention优化实现，无质量下降。 Conclusion: LiteAttention有效克服了传统稀疏注意力方法的局限，在保持生成质量的同时大幅降低计算延迟，适用于高质量视频生成场景。 Abstract: Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

[119] From Retinal Pixels to Patients: Evolution of Deep Learning Research in Diabetic Retinopathy Screening

Muskaan Chopra,Lorenz Sparrenberg,Armin Berger,Sarthak Khanna,Jan H. Terheyden,Rafet Sifa

Main category: cs.CV

TL;DR: 本综述系统总结了2016-2025年间糖尿病视网膜病变（DR）深度学习研究的进展，涵盖自监督、半监督、领域泛化、联邦学习和神经符号模型等方法，评估了50多项研究和20多个数据集，指出了多中心验证和临床信任等开放问题，并提出了可重复、隐私保护且可临床部署的DR AI发展路线图。

Details

Motivation: 早期检测糖尿病视网膜病变对预防视力丧失至关重要，近年来深度学习在该领域快速发展，但存在数据不平衡、标签稀缺、领域偏移和可解释性等问题，亟需系统性梳理与总结。 Method: 本文对2016至2025年的50多项研究和20多个数据集进行了系统综述，分析了自监督与半监督学习、领域泛化、联邦训练、混合神经符号模型等技术进展，并评估了不同方法在多个数据集上的性能表现。 Result: 整理出涵盖多种技术和数据集的基准表格，揭示了当前在多中心验证、结果可复现性和临床可信度方面的不足，明确了现有方法的优势与局限。 Conclusion: 通过连接技术进步与实际转化障碍，本文提出了一条实现可复现、隐私保护并具备临床部署潜力的DR人工智能发展路径，相关创新也可推广至大规模医学影像分析。 Abstract: Diabetic Retinopathy (DR) remains a leading cause of preventable blindness, with early detection critical for reducing vision loss worldwide. Over the past decade, deep learning has transformed DR screening, progressing from early convolutional neural networks trained on private datasets to advanced pipelines addressing class imbalance, label scarcity, domain shift, and interpretability. This survey provides the first systematic synthesis of DR research spanning 2016-2025, consolidating results from 50+ studies and over 20 datasets. We critically examine methodological advances, including self- and semi-supervised learning, domain generalization, federated training, and hybrid neuro-symbolic models, alongside evaluation protocols, reporting standards, and reproducibility challenges. Benchmark tables contextualize performance across datasets, while discussion highlights open gaps in multi-center validation and clinical trust. By linking technical progress with translational barriers, this work outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI. Beyond DR, many of the surveyed innovations extend broadly to medical imaging at scale.

[120] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao,Chang Liu,Yuangang Li

Main category: cs.CV

TL;DR: 提出了一种名为S2D-Align的新监督微调范式，通过从浅到深的多阶段对齐策略，结合不同粒度的辅助信号，实现放射影像报告生成中的解剖学 grounded 对齐，显著提升了生成质量。

Details

Motivation: 现有方法仅通过图像-文本对进行实例级对齐，缺乏对解剖结构的细粒度对齐，导致报告生成质量受限。 Method: 提出S2D-Align，采用从浅到深的策略：首先进行图像-报告粗对齐，引入参考报告进行实例级指导，最后利用关键短语实现解剖细节的细粒度对齐；并通过基于记忆的适配器实现各阶段特征共享。 Result: 在MIMIC-CXR和IU X-Ray数据集上达到SOTA性能，消融实验验证了各阶段辅助信号的有效性。 Conclusion: S2D-Align通过多粒度辅助信号和渐进式对齐策略，有效增强了跨模态生成中的解剖学 grounding 能力，为医学报告生成提供了新思路。 Abstract: Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

[121] Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image

Matthias Humt,Ulrich Hillenbrand,Rudolph Triebel

Main category: cs.CV

TL;DR: 本文比较了去噪扩散概率模型和自回归因果变换器在生成形状建模与补全任务中的性能，发现基于连续潜在空间的扩散模型在多模态形状补全任务上表现最佳，而在相同离散潜在空间下，自回归模型可匹敌或超越扩散模型。

Details

Motivation: 目前尚无共识确定哪种生成模型最适合特定3D生成任务，且部分条件信息（如部分3D数据）未被充分评估，因此需系统比较主流模型在生成与补全任务中的表现。 Method: 采用去噪扩散概率模型和自回归因果变换器，并针对生成形状建模与补全任务进行适配，通过定量评估、基线对比及消融实验进行分析。 Result: 1) 基于连续潜在空间的扩散模型优于判别模型和自回归模型，在单张噪声深度图像的多模态形状补全中达到SOTA；2) 在相同离散潜在空间下，自回归模型性能可匹配或超过扩散模型。 Conclusion: 扩散模型在连续空间中表现最优，而自回归模型在离散空间中具有竞争力，模型选择应根据潜在空间特性与任务需求权衡。 Abstract: While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models--Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers--which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.

[122] Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids

Ke Ma,Yizhou Fang,Jean-Baptiste Weibel,Shuai Tan,Xinggang Wang,Yang Xiao,Yi Fang,Tian Xia

Main category: cs.CV

TL;DR: 本文提出了一个名为Phys-Liquid的物理信息数据集，包含97,200张模拟图像和对应的3D网格，用于提升透明液体在动态场景下的几何与体积估计精度，并通过四阶段重建流程验证其真实性和有效性。

Details

Motivation: 由于透明液体在容器运动时会产生复杂的光学效应和动态表面变形，现有数据集缺乏涵盖多样化动态场景的物理真实模拟数据，难以支持机器人对液体状态的准确感知。 Method: 构建了一个包含多种实验室场景、光照条件、液体颜色和容器旋转的物理仿真数据集，并提出一个四阶段的重建与估计算法流程：液体分割、多视角掩码生成、3D网格重建和现实世界尺度映射。 Result: 实验结果显示该方法在液体几何形状和体积重建方面比现有基准更具准确性与一致性。 Conclusion: Phys-Liquid数据集及其验证方法为透明液体感知任务提供了有力支持，推动了液体状态估计在机器人操作中的应用。 Abstract: Estimating the geometric and volumetric properties of transparent deformable liquids is challenging due to optical complexities and dynamic surface deformations induced by container movements. Autonomous robots performing precise liquid manipulation tasks, such as dispensing, aspiration, and mixing, must handle containers in ways that inevitably induce these deformations, complicating accurate liquid state assessment. Current datasets lack comprehensive physics-informed simulation data representing realistic liquid behaviors under diverse dynamic scenarios. To bridge this gap, we introduce Phys-Liquid, a physics-informed dataset comprising 97,200 simulation images and corresponding 3D meshes, capturing liquid dynamics across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. To validate the realism and effectiveness of Phys-Liquid, we propose a four-stage reconstruction and estimation pipeline involving liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling. Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks. The dataset and associated validation methods facilitate future advancements in transparent liquid perception tasks. The dataset and code are available at https://dualtransparency.github.io/Phys-Liquid/.

[123] SplineSplat: 3D Ray Tracing for Higher-Quality Tomography

Youssef Haouchat,Sepand Kashani,Aleix Boquet-Pujadas,Philippe Thévenaz,Michael Unser

Main category: cs.CV

TL;DR: 提出了一种基于B样条和神经网络的高效计算3D体积投影的方法，在充分数据条件下实现了优于传统体素方法的重建质量。

Details

Motivation: 为了更高效准确地进行三维体积的断层投影计算，克服传统体素方法在重建质量上的局限性。 Method: 采用线性组合的移位B样条表示3D体积，结合任意投影几何的3D线积分射线追踪算法，并利用神经网络高效计算基函数的贡献。 Result: 在数据充足的正定问题中，无需正则化即可实现高精度重建，且重建质量优于传统体素方法。 Conclusion: 所提方法在特定条件下能显著提升重建质量，为高效高质量图像重建提供了新途径。 Abstract: We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.

[124] A Space-Time Transformer for Precipitation Forecasting

Levi Harris,Tianlong Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于视频Transformer的气象预测模型SaTformer，用于从卫星辐射数据中预测极端降水。该模型采用全时空注意力机制，并通过分类化回归和加权损失函数处理长尾降水数据分布，在NeurIPS Weather4Cast 2025挑战赛中取得了第一名的成绩。

Details

Motivation: 传统的数值天气预报模型在计算成本和短时临近预报（0-4小时）性能方面存在局限性，且现有AI气象预测方法对视频理解架构的应用探索不足，因此需要更高效、准确的数据驱动模型。 Method: 提出SaTformer，一种基于全时空注意力的视频Transformer模型；将降水回归问题转化为分类任务，并采用类别加权损失函数以缓解标签不平衡问题，从而更好地处理长尾分布的极端降水事件。 Result: SaTformer在NeurIPS Weather4Cast 2025累积降雨挑战赛中排名第一，表现出对极端降水事件的出色预测能力。 Conclusion: SaTformer通过引入视频Transformer和针对长尾数据的训练策略，显著提升了短时降水预测的准确性，展示了数据驱动方法在气象预报中的巨大潜力。 Abstract: Meteorological agencies around the world rely on real-time flood guidance to issue live-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose SaTformer: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge. Code and model weights are available: https://github.com/leharris3/satformer

[125] Machine-Learning Based Detection of Coronary Artery Calcification Using Synthetic Chest X-Rays

Dylan Saeed,Ramtin Gharleghi,Susann Bier,Sonit Singh

Main category: cs.CV

TL;DR: 本研究首次系统评估了数字重建放射影像（DRR）作为冠状动脉钙化（CAC）检测的替代训练域，利用CT扫描生成合成DRR图像，并探索多种模型训练策略，最终在无真实标签胸片上实现了具有竞争力的检测性能。

Details

Motivation: 由于CT成本高难以用于大规模筛查，而胸片缺乏可靠标签限制了深度学习应用，因此需要一种可扩展且带精确标签的替代方案来推动CAC检测的发展。 Method: 基于667例CT扫描生成带有精确标签的DRR图像，采用轻量级CNN、超分辨率增强、对比度增强及课程学习等策略进行模型训练与优化。 Result: 最佳模型配置在CAC检测任务中达到平均AUC为0.754，性能优于或媲美先前基于胸片的研究。 Conclusion: DRR是一种可扩展且富含标签的CAC检测训练数据来源，为未来向真实胸片的迁移学习和领域自适应奠定了基础。 Abstract: Coronary artery calcification (CAC) is a strong predictor of cardiovascular events, with CT-based Agatston scoring widely regarded as the clinical gold standard. However, CT is costly and impractical for large-scale screening, while chest X-rays (CXRs) are inexpensive but lack reliable ground truth labels, constraining deep learning development. Digitally reconstructed radiographs (DRRs) offer a scalable alternative by projecting CT volumes into CXR-like images while inheriting precise labels. In this work, we provide the first systematic evaluation of DRRs as a surrogate training domain for CAC detection. Using 667 CT scans from the COCA dataset, we generate synthetic DRRs and assess model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies. Lightweight CNNs trained from scratch outperform large pretrained networks; pairing super-resolution with contrast enhancement yields significant gains; and curriculum learning stabilises training under weak supervision. Our best configuration achieves a mean AUC of 0.754, comparable to or exceeding prior CXR-based studies. These results establish DRRs as a scalable, label-rich foundation for CAC detection, while laying the foundation for future transfer learning and domain adaptation to real CXRs.

[126] Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning

Mattia Ferrari,Giancarlo Papitto,Giorgio Deligios,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出一种基于对比学习的少样本学习方法，利用PRISMA高光谱卫星数据检测树皮甲虫侵害。

Details

Motivation: 树皮甲虫侵扰威胁针叶林健康，传统监测方法依赖大量标注数据，难以在数据稀缺场景下有效应用。 Method: 采用对比学习预训练一维CNN编码器，提取高光谱数据的鲁棒特征，并结合支持向量回归模型对每类少量样本进行像素级树种状态比例估计。 Result: 在多洛米蒂山区的实验表明，该方法优于使用原始PRISMA波段和Sentinel-2数据的方法，能更准确地估计健康、受害和死亡树木的比例。 Conclusion: PRISMA高光谱数据结合少样本学习框架在森林健康监测中具有显著优势，适用于标注样本有限的实际应用场景。 Abstract: Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.

[127] VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang,Yueying Wang,Rui Zhao,Toufiq Parag,Zhimin Chen,Zhenyu Liao,Jayakrishnan Unnikrishnan

Main category: cs.CV

TL;DR: 提出VideoP2R，一种面向视频的两阶段强化微调框架，在感知与推理分离建模的基础上，通过高质量过程感知数据集和新型PA-GRPO算法，在多个视频理解基准上达到SOTA性能。

Details

Motivation: 将现有的强化微调（RFT）方法扩展到大型视频语言模型（LVLMs）面临挑战，尤其是在建模视频中的感知与推理过程方面缺乏有效机制。 Method: 提出VideoP2R框架：在监督微调阶段构建包含感知与推理的过程感知三步流程，并生成VideoP2R-CoT-162K数据集；在强化学习阶段设计PA-GRPO算法，对感知和推理分别提供独立奖励信号。 Result: 在七个视频推理与理解基准中的六个上达到最先进（SotA）性能，消融实验验证了过程感知建模和PA-GRPO的有效性，并表明模型的感知输出已足够支持下游推理任务。 Conclusion: VideoP2R通过显式分离感知与推理过程，显著提升了LVLM的视频理解能力，为未来视频语言模型的训练提供了有效的过程感知强化微调范式。 Abstract: Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

[128] Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions

Redwan Hussain,Mizanur Rahman,Prithwiraj Bhattacharjee

Main category: cs.CV

TL;DR: 本文综述了24项关于AI生成媒体检测的最新研究，指出现有方法在跨模型泛化、多模态数据和高度修改内容上的局限性，并提出基于多模态深度学习的研究方向以提升检测鲁棒性和泛化能力。

Details

Motivation: 随着生成对抗网络和扩散模型的发展，AI生成内容愈发逼真，导致真实与合成内容难以区分，深伪技术的滥用引发 misinformation、隐私侵犯等问题，亟需有效的检测方法。 Method: 对24篇近期AI生成媒体检测研究进行系统性综述，分析各研究的贡献与不足，总结当前方法的共性挑战，特别是跨模型泛化、多模态处理和对抗性修改方面的缺陷。 Result: 识别出现有检测模型在泛化性、多模态数据处理和应对高度修改内容方面的关键局限；发现基于CNN和ViT的方法虽广泛应用，但面对新型生成模型时性能下降。 Conclusion: 多模态深度学习模型有望提供更鲁棒和通用的检测能力，未来研究应聚焦于此方向，以构建更有效的防御机制对抗有害合成媒体。 Abstract: Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.

[129] Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model

Xinyue Zhang,Haolong Li,Jiawei Ma,Chen Ye

Main category: cs.CV

TL;DR: 本文提出了一种新的大型矢量字形模型（LVGM），通过预测下一个笔画来生成矢量化的中文字符，利用深度求索大语言模型进行微调，并发布了一个包含907,267个样本的大规模中文SVG数据集。

Details

Motivation: 由于矢量字形在海报设计、网络动画和艺术展示等领域的广泛应用及其可扩展性和灵活性，研究者希望利用大语言模型的序列预测能力来实现高质量的矢量中文字形生成。 Method: 将笔画编码为离散的潜在变量（即笔画嵌入），然后通过对DeepSeek大语言模型进行微调以预测下一个笔画嵌入，从而训练LVGM模型。 Result: 实验结果表明，该模型在给定少量笔画的情况下能够生成完整的字符、语义优美的词语甚至未见过的诗句形式的矢量字形，并且模型表现出了对数据规模的可扩展性。生成的矢量字形已通过专家及相关人员验证。 Conclusion: 所提出的LVGM模型有效结合了大语言模型与矢量字形生成技术，展示了基于笔画预测生成高质量矢量中文字形的潜力，同时发布的大型数据集也为后续研究提供了重要资源。 Abstract: Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.

[130] Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

Yu Zhao,Ying Zhang,Xuhui Sui,Baohang Zhou,Li Shen,Dacheng Tao

Main category: cs.CV

TL;DR: 提出了一种名为Hindsight Distilled Reasoning (HinD) 的框架，结合知识鼓励偏好优化（KEPO），从多模态大语言模型中提取并利用内部知识推理能力，显著提升了知识型视觉问答（KBVQA）性能。

Details

Motivation: 现有KBVQA方法依赖隐式或显式知识，但缺乏明确的多步推理过程，且知识正确性与模型置信度存在错配问题。 Method: 1) 利用冻结的7B MLLM生成‘事后智慧’推理路径，构建Hindsight-Zero训练数据；2) 通过自蒸馏生成思维链（CoT）和知识生成器；3) 使用KEPO优化知识生成器，偏好低置信但有用的知识。 Result: 在OK-VQA和A-OKVQA数据集上验证了HinD的有效性，仅使用7B规模MLLM即可实现优越性能，无需商业API或外部知识。 Conclusion: HinD框架能有效激发MLLM内部知识推理能力，提升KBVQA任务表现，解决了推理透明性和知识-置信错配问题。 Abstract: Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.

[131] OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation

Zhanpeng Wang,Shuting Cao,Yuhang Lu,Yuhan Li,Na Lei,Zhongxuan Luo

Main category: cs.CV

TL;DR: 提出基于最优传输理论的OT-ALD框架，解决DDIB方法在图像到图像翻译中的低效和潜在分布不匹配问题，显著提升翻译效率与质量。

Details

Motivation: DDIB方法存在翻译效率低和潜在分布不匹配导致轨迹偏差的问题，限制了其性能。 Method: 引入最优传输（OT）理论，计算源域与目标域潜在分布间的OT映射，并以此映射后的分布作为目标域去噪起点，实现更高效的图像翻译。 Result: 在三个高分辨率数据集的四项任务上，相比最优基线模型，采样效率提升20.29%，FID分数平均降低2.6。 Conclusion: OT-ALD有效解决了DDIB中的潜在分布不匹配问题，在保持灵活性的同时显著提高了翻译效率和生成质量，是一种更具优势的I2I翻译框架。 Abstract: The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.

[132] Reverberation: Learning the Latencies Before Forecasting Trajectories

Conghao Wong,Ziqian Zou,Beihao Xia,Xinge You

Main category: cs.CV

TL;DR: 提出了一种基于声学混响曲线的新型混响变换及Reverberation（Rev）轨迹预测模型，通过可学习的混响核显式建模智能体在响应轨迹变化事件时的延迟偏好及其随机性，实现了可控且可解释的轨迹预测。

Details

Motivation: 现有轨迹预测方法难以显式建模智能体对轨迹变化事件响应的时间延迟（latency），而不同智能体在感知、处理和反应上的延迟差异会影响预测的因果连续性和合理性，因此需要一种能捕捉这种延迟偏好的建模方法。 Method: 受声学混响曲线启发，提出了混响变换和Rev模型，使用两个显式且可学习的混响核来模拟每个智能体不同的延迟偏好及其随机性，从而实现基于预测延迟的可控轨迹生成。 Result: 在多个行人和车辆数据集上的实验表明，Rev模型在预测精度上具有竞争力，并能揭示跨智能体和场景的可解释延迟动态特性；定性分析验证了混响变换的有效性。 Conclusion: Rev模型为轨迹预测中的延迟建模提供了一种新颖且可解释的框架，具有作为通用延迟建模范式的潜力。 Abstract: Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.

[133] Explainable Deep Convolutional Multi-Type Anomaly Detection

Alex George,Lyudmila Mihaylova,Sean Anderson

Main category: cs.CV

TL;DR: 提出了一种轻量级卷积框架MultiTypeFCDD，用于可解释的多类型异常检测，仅使用图像级标签生成对应不同类型异常的多通道热图，能够在多个对象类别中统一区分异常类型，无需为每个类别训练单独模型，在Real-IAD数据集上表现接近最先进模型，但参数更少、推理更快。

Details

Motivation: 现有可解释异常检测方法难以区分异常类型，且需为每类对象维护独立模型，成本高；而当前大型视觉语言模型虽能部分解决该问题，但计算和内存开销大，难以部署于实时或嵌入式系统。 Method: 提出MultiTypeFCDD，一种基于卷积的轻量级框架，利用图像级标签训练生成多通道热图，每个通道对应一类异常类型，实现单模型跨类别多类型异常识别与定位。 Result: 在Real-IAD数据集上达到与当前复杂模型相当的性能，同时显著降低模型参数量和推理时间。 Conclusion: MultiTypeFCDD是一种高效、实用的多类型异常检测方案，适用于资源受限的实际应用场景，能在保持高精度的同时实现异常类型的可解释性区分。 Abstract: Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., "Crack" vs. "Scratch") is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.

[134] CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios

Hangyu Li,Bofeng Cao,Zhaohui Liang,Wuzhen Li,Juyoung Oh,Yuxuan Chen,Shixiao Liang,Hang Zhou,Chengyuan Ma,Jiaxi Liu,Zheng Li,Peng Zhang,KeKe Long,Maolin Liu,Jackson Jiang,Chunlei Yu,Shengxiang Liu,Hongkai Yu,Xiaopeng Li

Main category: cs.CV

TL;DR: 本文提出了CATS-V2V，首个用于复杂恶劣交通场景下车辆间协同感知的真实世界数据集，包含多模态传感器数据和高精度标注，旨在推动自动驾驶在挑战性环境下的发展。

Details

Motivation: 现有自动驾驶数据集主要关注普通交通场景，缺乏复杂恶劣交通场景（CATS）下的数据，限制了协同感知技术的发展。因此，需要一个高质量、真实世界的V2V协同感知数据集来应对这一挑战。 Method: 通过两辆硬件时间同步的车辆采集数据，覆盖10种天气与光照条件及10个不同地点；提供LiDAR点云、多视角图像、高精度GNSS/IMU数据，并提出基于目标的时序对齐方法，实现跨传感器模态的精确时间一致性3D标注和4D BEV表示构建。 Result: 发布了包含100个片段、60K帧LiDAR数据、1.26M张图像和750K条高精度定位记录的CATS-V2V数据集，是目前规模最大、支持最全面、质量最高的同类数据集。 Conclusion: CATS-V2V填补了复杂恶劣交通场景下V2V协同感知数据的空白，有望促进自动驾驶系统在挑战性环境中的感知能力提升。 Abstract: Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.

[135] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

Ayush Pandey,Jai Bardhan,Ishita Jain,Ramya S Hebbalaguppe,Rohan Raju Dhanakshirur,Lovekesh Vig

Main category: cs.CV

TL;DR: 本文提出了一种基于多智能体辩论框架AlignVQA，用于提升视觉问答（VQA）系统中置信度估计的校准性，通过不同提示策略的专业化视觉语言模型生成答案，并经由通用智能体进行批评、优化和聚合，结合提出的可微校准感知损失aligncal，显著降低了校准误差。

Details

Motivation: 现代VQA系统常表现出过度自信的问题，其置信度与实际准确性不匹配，尤其在高风险领域如医疗诊断和自动驾驶中影响决策可靠性，因此需要提升系统的校准性。 Method: 提出AlignVQA框架：多个采用不同提示策略的专业化VLM生成候选答案，再由通用智能体进行两阶段交互（批评与聚合）；引入可微的校准感知损失aligncal，通过最小化校准误差上界来微调专业代理。 Result: 在多个VQA基准数据集上的实验表明，该方法显著减少了校准误差，提升了置信度估计的准确性，且更优校准的专业代理能产生更一致的置信度输出。 Conclusion: AlignVQA通过多智能体辩论机制和校准感知训练有效改善了VQA系统的置信度校准问题，增强了AI在视觉不确定性下自主决策的可靠性。 Abstract: In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.

[136] Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

Zhixin Xu,Hengyu Zhou,Yuan Liu,Wenhan Xue,Hao Pan,Wenping Wang,Bin Wang

Main category: cs.CV

TL;DR: 提出一种用于非同步多视角视频的4D高斯点阵重建的粗到精时间对齐方法，有效提升动态场景重建质量。

Details

Motivation: 现有4D高斯点阵方法假设多视角视频时间同步，但在实际中常因相机延迟等导致时间错位，影响重建效果。 Method: 设计一个粗到精的时间对齐模块，先估计帧级粗略偏移，再优化至亚帧精度，以补偿各相机的时间偏移，并可集成到现有4DGS框架中。 Result: 实验表明该方法能有效处理时间错位视频，显著提升基线方法的重建质量。 Conclusion: 所提时间对齐策略增强了4DGS在非同步输入下的鲁棒性，适用于真实场景中的多视角动态重建。 Abstract: Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

Quoc-Huy Trinh,Mustapha Abdullahi,Do Duy Hung Trinh,Bo Zhao,Debesh Jha

Main category: cs.CV

TL;DR: 提出Viper-F1，一种基于混合状态空间的视觉语言模型，通过液态状态空间动态和轻量级Token-Grid相关模块实现高效、细粒度的视觉理解。

Details

Motivation: 现有视觉语言模型依赖Transformer交叉注意力，计算复杂度高，且在细粒度视觉定位上表现不足，限制了其在资源受限场景中的应用。 Method: 采用液态状态空间动态替代注意力机制以降低计算复杂度，并设计Token-Grid相关模块通过FiLM条件调节状态空间动态，增强文本与图像区域的对齐。 Result: 在多个基准测试中，Viper-F1在保持线性推理时间的同时，实现了更高效的细粒度视觉语言理解性能。 Conclusion: Viper-F1在效率和准确性之间取得了良好平衡，适用于资源受限的实际应用场景。 Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.

[138] A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent & Surrounding Regions

Ansh Kushwaha,Kaushik Gopalan

Main category: cs.CV

TL;DR: 本文提出了一种高效的轻量级深度学习框架，用于印度次大陆及周边地区PM1、PM2.5和PM10的6小时临近预报，基于CAMS数据训练并验证，相较于Aurora基础模型显著提升了预测精度。

Details

Motivation: 为解决有限空间域内短时空气质量预报的准确性与计算效率问题，特别是在印度次大陆这一污染严重且人口密集的区域，需要高效、低偏差的专用模型。 Method: 采用CAMS全球大气成分预报分析场（0.4度分辨率）作为输入，设计三个轻量化的参数专用神经网络架构，输入为256x256区域，输出为中心128x128的印度重点区域；模型在2021–2023年数据上训练，并在2024年数据上独立评估。 Result: 所提模型在RMSE、MAE、Bias和SSIM指标上均显著优于Aurora基础模型，表现出更低的系统性偏差和更高的预测精度，同时具备快速推理能力。 Conclusion: 针对特定区域和污染物构建的紧凑型专用深度学习模型，在短临预报任务中优于通用大模型，具有实际应用潜力。 Abstract: This paper is a submission for the Weather4Cast~2025 complementary Pollution Task and presents an efficient framework for 6-hour lead-time nowcasting of PM$_1$, PM$_{2.5}$, and PM$_{10}$ across the Indian subcontinent and surrounding regions. The proposed approach leverages analysis fields from the Copernicus Atmosphere Monitoring Service (CAMS) Global Atmospheric Composition Forecasts at 0.4 degree resolution. A 256x256 spatial region, covering 28.4S-73.6N and 32E-134.0E, is used as the model input, while predictions are generated for the central 128x128 area spanning 2.8S-48N and 57.6E-108.4E, ensuring an India-centric forecast domain with sufficient synoptic-scale context. Models are trained on CAMS analyses from 2021-2023 using a shuffled 90/10 split and independently evaluated on 2024 data. Three lightweight parameter-specific architectures are developed to improve accuracy, minimize systematic bias, and enable rapid inference. Evaluation using RMSE, MAE, Bias, and SSIM demonstrates substantial performance gains over the Aurora foundation model, underscoring the effectiveness of compact & specialized deep learning models for short-range forecasts on limited spatial domains.

[139] Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge

Anushree Bhuskute,Kaushik Gopalan,Jeet Shah

Main category: cs.CV

TL;DR: 提出基于ConvGRU的迁移学习框架，利用单通道SEVIRI红外数据进行短期降雨预测，在Weather4Cast 2025竞赛中取得优异成绩。

Details

Motivation: 提高短时降雨预测精度，尤其是在仅使用单一红外波段数据的情况下，有效捕捉时空演变特征。 Method: 采用两阶段训练策略：第一阶段用ConvGRU预测SEVIRI亮温；第二阶段通过经验非线性变换将预测结果转为OPERA兼容的降雨率，并结合3D事件检测和时空特征提取进行降水事件识别。 Result: 在累计降雨预测任务中获得第2名；在事件预测任务中表现与竞赛基线模型相当。 Conclusion: 该方法能有效利用单一红外通道进行短临降雨预报，且在不同任务间具有良好泛化能力。 Abstract: This study presents a transfer-learning framework based on Convolutional Gated Recurrent Units (ConvGRU) for short-term rainfall prediction in the Weather4Cast 2025 competition. A single SEVIRI infrared channel (10.8 μm wavelength) is used as input, which consists of four observations over a one-hour period. A two-stage training strategy is applied to generate rainfall estimates up to four hours ahead. In the first stage, ConvGRU is trained to forecast the brightness temperatures from SEVIRI, enabling the model to capture relevant spatiotemporal patterns. In the second stage, an empirically derived nonlinear transformation maps the predicted fields to OPERA-compatible rainfall rates. For the event-prediction task, the transformed rainfall forecasts are processed using 3D event detection followed by spatiotemporal feature extraction to identify and characterize precipitation events. Our submission achieved 2nd place in the cumulative rainfall task. Further, the same model was used out-of-the-box for the event prediction task, and resulted in similar scores as the baseline model to the competition.

[140] Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

Shambhavi Shanker,Manikandan Padmanaban,Jagabondhu Hazra

Main category: cs.CV

TL;DR: 提出了一种结合思维链（CoT）推理与直接偏好优化（DPO）的视觉问答（VQA）框架，用于提升卫星图像上的地理空间推理能力，尤其适用于气候相关应用。

Details

Motivation: 现有VQA模型在处理复杂地理空间查询时缺乏结构化推理能力，难以满足灾害监测、城市韧性规划等高风险气候应用的需求。 Method: 将思维链（CoT）推理与直接偏好优化（DPO）相结合，通过生成中间推理步骤（rationales）来增强模型对检测、分类、空间关系和比较分析等任务的处理能力。 Result: 实验表明，相比直接预测基线，CoT监督使准确率提升了34.9%，DPO进一步提高了准确性和推理质量。 Conclusion: 该框架显著提升了多光谱地球观测图像上VQA系统的可解释性、鲁棒性和准确性，推动了复杂气候应用场景下的地理空间智能决策支持。 Abstract: Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9\% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.

[141] Questioning the Stability of Visual Question Answering

Amir Rosenfeld,Neta Glazer,Ethan Fetaya

Main category: cs.CV

TL;DR: 本论文首次系统性研究了视觉语言模型（VLM）对语义保持的视觉和文本微小扰动的鲁棒性，发现现代VLM在像素级变化、几何变换、重述等扰动下表现不稳定，且状态最先进的模型也容易出错；同时提出样本稳定性可作为正确性的强指标，并可用于预测大模型的答案正确性。

Details

Motivation: 尽管VLM取得了显著进展，但其在输入发生微小且语义不变的变化时的可靠性仍不清楚，亟需评估其在真实场景中的稳健性。 Method: 通过对多种模型和数据集进行大规模实验，测试VLM在像素级扰动、几何变换、填充缩放、改写和多语言重写等语义保持变换下的表现，分析其预测稳定性。 Result: 现代VLM对微小扰动高度敏感，许多样本在至少一种扰动下答案发生变化；稳定性与正确性高度相关，小模型的稳定性模式可高精度预测大模型的答案正确性。 Conclusion: 当前VLM存在根本性脆弱性，应将鲁棒性评估从对抗性扰动扩展到语义不变的良性扰动，以推动更可靠模型的发展。 Abstract: Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

[142] One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger

Dongmei Shan,Wei Lian,Chongxia Wang

Main category: cs.CV

TL;DR: 提出首个面向3D视觉的一对多（one-to-N）后门攻击框架，基于可配置的球形触发器，利用球体空间特性实现单一触发器对应多个目标类别，理论与实验验证其在多种数据集和模型上的高攻击成功率（高达100%）且不损害干净样本精度。

Details

Motivation: 现有的3D点云后门攻击局限于一对一模式，难以应对复杂多目标威胁，缺乏灵活、可扩展的攻击范式，限制了对3D视觉系统安全性的全面评估。 Method: 设计一种新颖的可配置球形触发器，利用球体的空间参数（如位置、半径、密度）构建多维触发空间，并通过理论建模证明不同配置可映射至不同目标类别，实现在3D点云模型中的多标签精准控制。 Result: 在多个3D数据集和网络架构上验证了该方法的有效性，攻击成功率最高达100%，同时保持对干净样本的高分类准确率，系统性证实了一对多后门攻击的可行性。 Conclusion: 本工作建立了3D视觉中多目标后门攻击的首个理论与实践基准，揭示了3D模型在此类新型威胁下的脆弱性，为未来构建更安全的3D智能系统提供了重要依据和防御方向。 Abstract: Backdoor attacks represent a critical threat to deep learning systems, particularly in safety-sensitive 3D domains such as autonomous driving and robotics. However, existing backdoor attacks for 3D point clouds have been limited to a rigid one-to-one paradigm. To address this, we present the first one-to-N backdoor framework for 3D vision, based on a novel, configurable spherical trigger. Our key insight is to leverage the spatial properties of spheres as a parameter space, allowing a single trigger design to encode multiple target classes. We establish a theoretical foundation for one-to-N backdoor attacks in 3D, demonstrating that poisoned models can map distinct trigger configurations to different target labels. Experimental results systematically validate this conclusion across multiple datasets and model architectures, achieving high attack success rates (up to 100\%) while maintaining accuracy on clean data. This work establishes a crucial benchmark for multi-target threats in 3D vision and provides the foundational understanding needed to secure future 3D-driven intelligent systems.

Mohammad Areeb Qazi,Munachiso S Nwadike,Ibrahim Almakky,Mohammad Yaqub,Numan Saeed

Main category: cs.CV

TL;DR: 提出MAFM^3框架，通过轻量级模块化组件实现单一基础模型在多模态、多任务医学影像中的自适应扩展，显著提升性能。

Details

Motivation: 医学影像中数据稀缺，难以针对每个领域、模态或任务单独预训练模型，需要一种高效统一的适应框架。 Method: 设计模块化适配组件，使基础模型能根据输入类型或临床目标在推理时灵活激活相应能力，支持多任务和多模态扩展。 Result: 将胸部CT基础模型成功扩展至预后预测和分割任务，并引入PET扫描后Dice分数提升5%。 Conclusion: 配备模块化组件的基础模型可突破初始训练范围，演变为支持多任务、多模态的医学影像系统。 Abstract: Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM

[144] RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting

Ruocheng Wu,Haolan He,Yufei Wang,Zhihao Li,Bihan Wen

Main category: cs.CV

TL;DR: 提出了一种基于预训练视频扩散模型的指导分数蒸馏（GSD）框架，用于改善稀疏视角下3D高斯点阵的重建质量。

Details

Motivation: 在输入视角稀疏的情况下，3D高斯点阵容易过拟合，缺乏中间视角的监督。因此需要引入多视角一致性先验来提升重建效果。 Method: 利用预训练的视频扩散模型（VDM），通过分数蒸馏采样思想设计GSD框架，从多个邻近视角渲染图像进行监督；引入统一的指导形式，结合深度 warp 指导和语义特征指导，校正VDM的噪声预测方向，使其与真实相机位姿和几何结构对齐。 Result: 实验结果表明，该方法在多个数据集上优于现有方法，显著提升了稀疏输入下的3DGS重建质量。 Conclusion: GSD有效利用了VDM中的多视角一致性先验，解决了稀疏视角下3DGS的过拟合问题，实现了更准确、几何一致的3D场景表示。 Abstract: 3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.

[145] Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?

Kebin Wu,Fatima Albreiki

Main category: cs.CV

TL;DR: 本文研究了多模态表示模型中的位置偏差问题，特别是在图像-文本检索任务中。研究发现，文本编码器倾向于关注输入的开头部分，而图像编码器则在输入的开头和结尾都表现出偏差。这种偏差受位置编码方案、训练损失、上下文重要性以及多模态训练中使用图像-文本对等因素的影响。

Details

Motivation: 尽管位置偏差在文本生成模型中已被广泛研究，但在表示模型尤其是多模态模型中的存在及其影响仍不清楚。因此，本文旨在探究多模态表示模型中的位置偏差现象。 Method: 通过区分上下文重要性和位置偏差，评估不同模型和数据集上的位置偏差的存在与程度，并分析其成因。 Result: 实验证明，位置偏差在多模态模型中普遍存在，但不同模态表现不同：文本编码器偏向输入的开始位置，图像编码器则在开始和结束位置均有偏差。该偏差由位置编码方式、训练损失、上下文重要性及多模态训练中图像-文本配对的特性共同引发或加剧。 Conclusion: 多模态表示模型中存在显著的位置偏差，且其来源多样，需在未来的研究中加以考虑以提升模型性能。 Abstract: Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.

[146] 3D Gaussian and Diffusion-Based Gaze Redirection

Abiram Panchalingam,Indu Bodala,Stuart Middleton

Main category: cs.CV

TL;DR: 本文提出了DiT-Gaze，一种结合扩散变换器（DiT）、弱监督策略和正交约束损失的高保真视线重定向框架，在感知质量和重定向精度上均达到最先进水平。

Details

Motivation: 现有的3D高斯点阵化模型在生成连续、细微的视线变化时存在困难，需要更高质量的合成数据来提升视线估计模型的泛化能力。 Method: 提出DiT-Gaze框架，采用扩散变换器提升图像生成质量，通过合成中间视线角度进行跨角度弱监督训练，并引入正交约束损失以解耦视线、头部姿态和表情的表征。 Result: 实验表明，DiT-Gaze在感知质量和重定向精度上均优于现有方法，将最先进的视线估计误差从6.63度降低至6.353度，相对减少了4.1%。 Conclusion: DiT-Gaze为高保真视线重定向提供了有效解决方案，显著提升了合成数据质量，有助于改善视线估计模型的训练与性能。 Abstract: High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.

[147] DoReMi: A Domain-Representation Mixture Framework for Generalizable 3D Understanding

Mingwei Xing,Xinliang Wang,Yifeng Shi

Main category: cs.CV

TL;DR: 提出DoReMi框架，结合领域感知专家分支与统一表征分支，通过动态路由和熵控分配机制实现多域点云的协同学习，在ScanNet和S3DIS上取得优异分割性能。

Details

Motivation: 现有3D深度学习在多域泛化上受限于数据集规模小和多源点云异质性高，不同传感器产生的点云在密度和噪声分布上差异大，导致多域融合时出现负迁移；现有方法仅关注领域特定或通用特征，忽视二者协同作用。 Method: 提出DoReMi，一种基于Mixture-of-Experts的框架，包含领域感知专家分支和一个预训练的统一表征分支；通过领域引导的空间路由（DSR）动态激活相应专家，并利用熵控动态分配（EDA）机制稳定利用专家；统一分支通过多属性自监督学习预训练以保持跨域几何结构先验。 Result: 在多个3D理解基准上评估，DoReMi在ScanNet Val上达到80.1% mIoU，在S3DIS上达到77.2% mIoU，性能优于或媲美现有方法。 Conclusion: DoReMi能有效融合领域特定与通用表征，实现稳定的多域点云学习，展现出作为未来3D理解基础框架的潜力。 Abstract: The generalization of 3D deep learning across multiple domains remains limited by the limited scale of existing datasets and the high heterogeneity of multi-source point clouds. Point clouds collected from different sensors (e.g., LiDAR scans and mesh-derived point clouds) exhibit substantial discrepancies in density and noise distribution, resulting in negative transfer during multi-domain fusion. Most existing approaches focus exclusively on either domain-aware or domain-general features, overlooking the potential synergy between them. To address this, we propose DoReMi (Domain-Representation Mixture), a Mixture-of-Experts (MoE) framework that jointly models Domain-aware Experts branch and a unified Representation branch to enable cooperative learning between specialized and generalizable knowledge. DoReMi dynamically activates domain-aware expert branch via Domain-Guided Spatial Routing (DSR) for context-aware expert selection and employs Entropy-Controlled Dynamic Allocation (EDA) for stable and efficient expert utilization, thereby adaptively modeling diverse domain distributions. Complemented by a frozen unified representation branch pretrained through robust multi-attribute self-supervised learning, DoReMi preserves cross-domain geometric and structural priors while maintaining global consistency. We evaluate DoReMi across multiple 3D understanding benchmarks. Notably, DoReMi achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches, and showing strong potential as a foundation framework for future 3D understanding research. The code will be released soon.

[148] Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing

Cong Cao,Yujie Xu,Xiaodong Xu

Main category: cs.CV

TL;DR: 本文提出了一种新颖的少样本风格编辑框架，结合多专家低秩适应（MoE LoRA）与风格特定和共享路由机制，有效应对图像编辑中新风格微调的数据稀缺问题。

Details

Motivation: 通用图像编辑模型在面对新风格时表现不佳，尤其是在配对数据有限的情况下难以有效微调。因此，需要一种能够高效适应多种新风格且参数开销小的方法。 Method: 提出基于Mixture-of-Experts的低秩适应（MoE LoRA），引入风格特定和风格共享的路由机制，在Diffusion in Transformer (DiT) 模型中优化LoRA插入位置，并结合对抗学习与流匹配来引导扩散训练过程；通过重要性评分自动确定各层最优秩。 Result: 在包含五种不同风格的基准数据集上验证了方法的有效性，实验结果表明该方法在显著减少LoRA参数量的同时，性能优于现有的最先进方法。 Conclusion: 所提出的MoE LoRA框架在少样本多风格图像编辑任务中实现了高效、低干扰的微调，兼具参数效率和生成质量，具有良好的扩展性和实用性。 Abstract: In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.

[149] Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Zhongbin Guo,Jiahe Liu,Yushan Li,Wenyu Gao,Zhen Yang,Chenzhi Li,Xinyue Zhang,Ping Jian

Main category: cs.CV

TL;DR: 本文提出了一种名为GEODE的新架构，通过解耦3D推理与数值生成，解决了现有视觉语言模型在3D空间智能理解上的双重瓶颈问题。

Details

Motivation: 现有视觉语言模型由于输入阶段的几何感知编码器计算成本高以及输出阶段离散化标记器无法生成精确连续值，难以有效理解真实世界的3D空间信息。 Method: 引入GEODE架构，包含两个专用模块：解耦推理模块（DRM）用于融合3D数据与2D特征并生成空间推理标记；直接回归头（DRH）通过轻量级MLP实现标量和3D边界框的精确连续回归。 Result: 该1.5B参数模型在空间推理任务上达到先进水平，性能媲美7B以上的大模型。 Conclusion: GEODE通过模块化设计有效解决了VLM在3D空间理解中的输入与输出瓶颈，显著提升了小模型的空间智能表现。 Abstract: Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

[150] Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs

Jitesh Chavan,Rohit Lal,Anand Kamat,Mengjia Xu

Main category: cs.CV

TL;DR: Arcee提出了一种跨块循环状态链机制，通过复用每个块的终端状态空间表示作为下一个块的初始条件，显著提升了Mamba在视觉任务中的性能，尤其在无条件生成任务中FID指标大幅下降。

Details

Motivation: 现有的Mamba模型在处理非序列信号（如图像）时，由于每块的状态空间动态从零重新初始化，丢失了前一块的终端状态表示，限制了长距离依赖建模。 Method: Arcee引入跨块 recurrent 状态链，将前一块的终端状态空间表示（SSR）作为下一块的初始条件，并构建可微分的边界映射以实现端到端梯度传播。该方法兼容现有视觉Mamba变体，无需额外参数且计算开销极小。 Result: 在CelebA-HQ数据集（256×256）上使用Flow Matching进行无条件生成时，Arcee将FID从82.81降至15.33，性能提升5.4倍。 Conclusion: Arcee通过保留和传递块间状态信息，有效增强了Mamba架构在视觉任务中的上下文建模能力，是一种高效、通用且即插即用的改进方案。 Abstract: State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.

[151] Toward Gaze Target Detection of Young Autistic Children

Shijian Deng,Erin E. Kosloski,Siva Sai Nagender Vasireddy,Jia Li,Randi Sierra Sherwood,Feroz Mohamed Hatha,Siddhi Patel,Pamela R Rollins,Yapeng Tian

Main category: cs.CV

TL;DR: 本文提出了一种用于自闭症儿童注视目标检测的新型AI框架SACF，并发布了首个自闭症注视目标数据集AGT，通过利用场景的社会上下文信息，在真实世界场景中实现了最先进的性能，尤其改善了对人脸注视等少数类别的检测效果。

Details

Motivation: 自闭症儿童常表现出对人脸注视减少，导致现有数据集中存在类别不平衡问题，且缺乏足够的专业人员进行干预。因此，需要一种能够自动检测其注视目标的技术，以评估和促进联合注意能力的发展。 Method: 提出了一个社会感知由粗到精（SACF）的注视检测框架，采用双通路架构，分别处理社交和非社交注视，通过上下文感知门控模块动态融合；同时构建了首个自闭症儿童活动场景下的注视目标数据集AGT。 Result: 实验结果表明，该方法在自闭症儿童群体中的注视目标检测任务上达到了最先进的性能，尤其在关键的少数类别人脸注视上显著优于现有方法。 Conclusion: SACF框架有效解决了自闭症数据集中因注视偏好偏差带来的类别不平衡问题，为自动化评估联合注意提供了可行方案，具有临床和实际应用潜力。 Abstract: The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

[152] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Melika Behjati,James Henderson

Main category: cs.CV

TL;DR: 提出一种通过分组标题词元来捕捉语言细粒度表示的视觉-语言模型，提升对视觉和语言的细粒度理解，并与发现物体的图像编码器对齐。

Details

Motivation: 现有方法多关注图像块与语言词元的对齐，但图像块和单个词元缺乏可解释性，难以对应真实场景中的语义单元。需要更符合人类感知的、基于词元组的细粒度对齐方式。 Method: 在模型架构中引入词元分组机制，将描述场景不同方面的词元进行聚类，并将这些语言表示与专用于发现物体的图像编码器输出对齐，实现对象级别的跨模态匹配。 Result: 该方法提升了视觉-语言模型的细粒度理解能力；所发现的词元组在定性和定量上均与文本中可 grounding 的短语高度相似。 Conclusion: 通过学习词元分组，模型能更好地捕捉语言中的语义结构，并与图像中的物体对齐，从而增强视觉-语言模型的细粒度语义理解能力。 Abstract: Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

[153] CountSteer: Steering Attention for Object Counting in Diffusion Models

Hyemin Boo,Hyoryung Kim,Myungjin Lee,Seunghyeon Lee,Jiyoung Lee,Jang-Hwan Choi,Hyunsoo Cho

Main category: cs.CV

TL;DR: 提出了一种无需训练的文本到图像生成方法CountSteer，通过调控交叉注意力隐藏状态提升模型对指定对象数量的准确性，实验显示其在不牺牲视觉质量的前提下将计数准确率提高了约4%。

Details

Motivation: 现有文本到图像扩散模型在遵循数值指令方面表现不佳，存在语言与视觉表征之间的鸿沟。然而，研究发现模型内部信号会根据生成结果是否符合指定数量而系统变化，表明其隐含具备对数字正确性的感知能力。 Method: 基于模型对自身计数准确性的隐式感知，提出CountSteer方法，在推理过程中调节模型的交叉注意力隐藏状态，以引导生成符合指定对象数量的图像，无需额外训练。 Result: 实验结果显示，CountSteer将对象计数准确率提升了约4%，同时保持了良好的视觉质量。 Conclusion: CountSteer为实现更可控、语义更可靠的文本到图像生成提供了一种简单而有效的方法，揭示了利用模型内部信号提升生成控制精度的潜力。 Abstract: Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.

[154] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs

Massimo Rizzoli,Simone Alghisi,Seyed Mahed Mousavi,Giuseppe Riccardi

Main category: cs.CV

TL;DR: 提出一种通过控制生成无偏、分布均衡且标注准确的合成数据来改进视觉-语言模型微调的方法，实验表明该方法在真实世界数据上显著提升性能并缓解常见偏差。

Details

Motivation: 传统微调依赖人工收集和标注的真实场景数据，易引入偏差、错误和分布不均衡问题，导致过拟合和性能不平衡。现有合成数据方法缺乏对分布偏差和标注质量的控制。 Method: 设计了一种自动构建合成数据集的方法，通过对物体属性（如颜色、形状、大小、位置）进行系统性采样，确保数据和标注无偏且分布均衡；使用该数据集对先进VLMs进行微调，并评估其在真实世界数据上的位置任务性能迁移能力。 Result: 实验表明：1）在平衡的合成数据上微调可实现整个视觉场景中的均匀性能并缓解常见偏差；2）在合成数据上微调显著提升模型在真实数据（COCO）上的表现，优于在匹配设置下微调的模型。 Conclusion: 通过受控生成的高质量合成数据进行微调，能有效解决真实数据带来的偏差与不平衡问题，并提升模型在现实任务中的泛化性能。 Abstract: Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.

[155] DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Dawei Zhu,Rui Meng,Jiefeng Chen,Sujian Li,Tomas Pfister,Jinsung Yoon

Main category: cs.CV

TL;DR: 本文提出了一种名为DocLens的工具增强多智能体框架，用于提升视觉语言模型对长篇视觉文档的理解能力，通过“聚焦”机制实现精准证据定位，在多个基准上达到领先性能。

Details

Motivation: 现有视觉语言模型在处理长篇视觉文档时难以准确定位信息分布广泛的证据，导致性能受限和模型幻觉问题。 Method: DocLens采用多智能体框架，首先从整个文档导航到相关页面的特定视觉元素，然后通过采样-裁决机制生成可靠答案，实现细粒度的证据定位与整合。 Result: 结合Gemini-2.5-Pro，DocLens在MMLongBench-Doc和FinRAGBench-V上取得当前最优性能，尤其在以视觉为中心和无法回答的问题上表现突出，甚至超过人类专家。 Conclusion: DocLens通过增强的证据定位能力显著提升了长篇视觉文档理解的效果，验证了其在复杂文档分析中的有效性与优越性。 Abstract: Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

[156] GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving

Fabian Schmidt,Markus Enzweiler,Abhinav Valada

Main category: cs.CV

TL;DR: 提出一种模型无关的方法，通过交通场景图提供结构化关系上下文，提升视觉-语言自动驾驶模型的拓扑感知推理能力，在LangAuto基准上显著提高驾驶性能。

Details

Motivation: 现有视觉-语言模型缺乏显式编码空间结构和动态交互关系的监督，限制了其对交通实体间相互影响的推理能力。 Method: 将多层级、多格式的交通场景图序列化，并通过结构化提示模板融入语言驱动模型中，实现对关系监督的系统性分析。 Result: 在LangAuto基准上，LMDrive驾驶得分最高提升15.6%，BEVDriver提升17.5%，且无需测试时输入场景图。 Conclusion: 通过场景图条件训练可有效增强模型对关系先验的学习，显著提升自动驾驶规划性能。 Abstract: Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6\% increase in driving score for LMDrive and 17.5\% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.

[157] Φeat: Physically-Grounded Feature Representation

Giuseppe Vecchio,Adrien Kaiser,Rouffet Romain,Rosalie Martin,Elena Garces,Tamy Boubekeur

Main category: cs.CV

TL;DR: 本文提出了一种新的物理感知视觉骨干网络$Φ$eat，通过自监督预训练策略学习对材质身份敏感的表征，能够在不依赖显式标签的情况下提取与几何和光照无关的物理稳健特征。

Details

Motivation: 现有的自监督特征将高层语义与低层物理因素（如几何和光照）纠缠在一起，限制了其在需要明确物理推理任务中的应用。因此，需要一种能够解耦这些因素并关注材质本质属性（如反射特性与微观结构）的视觉骨干网络。 Method: 提出$Φ$eat，采用对比学习策略，在不同形状和光照条件下对同一材料的不同空间裁剪和物理增强进行对比，从而鼓励网络学习对材质恒定但对物理变化鲁棒的表示，整个过程无需显式标注。 Result: 通过特征相似性分析和材质选择任务验证，$Φ$eat能有效捕捉超越语义分组的物理结构信息，表现出对物理因素变化的强鲁棒性。 Conclusion: $Φ$eat展示了无监督物理特征学习在构建面向物理感知的视觉与图形系统的潜力，为实现具备物理理解能力的视觉模型提供了新方向。 Abstract: Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Φ$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Φ$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

[158] Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation

Haoyi Wang

Main category: cs.CV

TL;DR: 提出了一种名为Coordinative Ordinal-Relational Anatomical Learning (CORAL)的新方法，用于在有限标注条件下提升体医学图像分割性能，通过结合对比排序和序数目标来捕捉局部与全局解剖结构。

Details

Motivation: 现有方法使用硬二值阈值定义正负样本，忽略了连续的解剖相似性信息，并且未考虑解剖进展的全局方向一致性，导致特征空间扭曲。 Method: CORAL采用对比排序目标来利用连续的解剖相似性，并引入序数目标以强制全局方向一致性，从而学习具有解剖意义的特征表示。 Result: 在多个基准数据集上，CORAL在少样本设置下实现了最先进的分割性能，并学习到具有明确解剖结构的特征表示。 Conclusion: CORAL有效整合了局部关系与全局解剖顺序，提升了体医学图像分割的性能，尤其适用于标注数据稀缺的场景。 Abstract: Volumetric medical image segmentation presents unique challenges due to the inherent anatomical structure and limited availability of annotations. While recent methods have shown promise by contrasting spatial relationships between slices, they rely on hard binary thresholds to define positive and negative samples, thereby discarding valuable continuous information about anatomical similarity. Moreover, these methods overlook the global directional consistency of anatomical progression, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients. To address these limitations, we propose Coordinative Ordinal-Relational Anatomical Learning (CORAL) to capture both local and global structure in volumetric images. First, CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity, ensuring relational feature distances between slices are proportional to their anatomical position differences. In addition, CORAL incorporates an ordinal objective to enforce global directional consistency, aligning the learned feature distribution with the canonical anatomical progression across patients. Learning these inter-slice relationships produces anatomically informed representations that benefit the downstream segmentation task. Through this coordinative learning framework, CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure. Code is available at https://github.com/haoyiwang25/CORAL.

[159] D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces

Ruoqi Wang,Haitao Wang,Shaojie Guo,Qiong Luo

Main category: cs.CV

TL;DR: 提出D-GAP方法，通过在频域和像素空间进行梯度引导的自适应增强，提升模型在分布外场景下的鲁棒性。

Details

Motivation: 现有增强方法在应对域偏移时效果有限，神经网络易受特定频率成分偏差影响，且频域扰动忽略像素细节。 Method: 基于任务梯度生成频域敏感图，自适应插值源和目标样本的幅度；同时在像素空间进行融合以保留细节。 Result: 在四个真实数据集和三个域适应基准上显著优于通用和数据集特定增强方法，OOD性能分别平均提升5.3%和1.8%。 Conclusion: D-GAP通过联合频域和像素空间的梯度引导增强，有效缓解学习偏差并保留细节，提升了分布外鲁棒性。 Abstract: Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.

[160] RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image

Hengfei Wang,Zhongqun Zhang,Yihua Cheng,Hyung Jin Chang

Main category: cs.CV

TL;DR: 本文提出了一种名为RTGaze的实时高质量凝视重定向方法，通过学习凝视可控的面部表征并结合神经渲染与3D先验知识，实现了高效、高保真且3D一致的凝视重定向。

Details

Motivation: 现有凝视重定向方法在3D一致性、效率或生成质量方面存在不足，限制了实际应用。因此需要一种兼顾速度、质量和3D感知能力的新方法。 Method: RTGaze通过从人脸图像和凝视提示中学习凝视可控的面部表征，并利用神经渲染进行解码生成；同时从预训练的3D人像生成器中蒸馏面部几何先验以提升生成质量。 Result: RTGaze在多个数据集上实现了最先进的效率、重定向准确性和图像质量，单张图像处理时间约0.06秒，比此前最先进的3D感知方法快800倍。 Conclusion: RTGaze实现了实时、高质量、3D感知的凝视重定向，显著提升了处理速度与视觉一致性，具有较强的实用性与应用潜力。 Abstract: Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.

[161] SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing

Yichao Tang,Mingyang Li,Di Miao,Sheng Li,Zhenxing Qian,Xinpeng Zhang

Main category: cs.CV

TL;DR: 提出了一种无需噪声模拟的水印框架SimuFreeMark，利用图像低频分量的稳定性在深度特征空间中嵌入水印，显著提升了对传统和语义攻击的鲁棒性。

Details

Motivation: 现有基于深度学习的图像水印方法依赖手工设计的噪声模拟层，限制了对未知失真的泛化能力，难以应对复杂的语义编辑攻击。 Method: 通过分析发现图像低频分量具有较强的抗攻击能力，因此在预训练变分自编码器（VAE）的低频深度特征空间中直接嵌入水印，完全避免了训练过程中的噪声模拟。 Result: 实验表明，SimuFreeMark在多种传统信号处理和语义编辑攻击下均优于现有最先进方法，同时保持更高的视觉质量。 Conclusion: SimuFreeMark通过摒弃噪声模拟并利用低频成分的内在稳定性，提供了一种更通用、鲁棒且高效的图像水印新范式。 Abstract: The advancement of artificial intelligence generated content (AIGC) has created a pressing need for robust image watermarking that can withstand both conventional signal processing and novel semantic editing attacks. Current deep learning-based methods rely on training with hand-crafted noise simulation layers, which inherently limit their generalization to unforeseen distortions. In this work, we propose $\textbf{SimuFreeMark}$, a noise-$\underline{\text{simu}}$lation-$\underline{\text{free}}$ water$\underline{\text{mark}}$ing framework that circumvents this limitation by exploiting the inherent stability of image low-frequency components. We first systematically establish that low-frequency components exhibit significant robustness against a wide range of attacks. Building on this foundation, SimuFreeMark embeds watermarks directly into the deep feature space of the low-frequency components, leveraging a pre-trained variational autoencoder (VAE) to bind the watermark with structurally stable image representations. This design completely eliminates the need for noise simulation during training. Extensive experiments demonstrate that SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks, while maintaining superior visual quality.

Haokun Chen,Jianing Li,Yao Zhang,Jinhe Bi,Yan Xia,Jindong Gu,Volker Tresp

Main category: cs.CV

TL;DR: 本文提出了一种针对多模态大语言模型中视觉概念遗忘的新框架AUVIC，通过对抗性扰动实现精确遗忘，并构建了首个用于评估群组情境下视觉概念遗忘的基准VCUBench。

Details

Motivation: 由于多模态大语言模型在训练中可能包含敏感或受版权保护的数据，存在数据隐私问题，因此需要研究如何有效移除特定视觉概念以满足‘被遗忘权’等监管要求。 Method: 提出AUVIC框架，利用对抗性扰动对目标视觉概念进行精确遗忘，并构建VCUBench基准来评估在群体上下文中的视觉概念遗忘效果。 Result: 实验结果表明，AUVIC在实现最先进目标遗忘率的同时，对非目标概念的性能影响极小。 Conclusion: AUVIC能够有效、精准地遗忘多模态大语言模型中的视觉概念，同时保持模型在其他相关任务上的性能稳定，推动了视觉概念机器遗忘技术的发展。 Abstract: Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

[163] 6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data

Saptarshi Neil Sinha,Julius Kühn,Mika Silvan Goschke,Michael Weinmann

Main category: cs.CV

TL;DR: 本论文提出了一种基于纯合成数据的草莓6D姿态估计方法，使用YOLOX-6D-Pose算法和Blender生成的逼真渲染数据，在RTX 3090和Jetson Orin Nano上实现了高精度与实时性，适用于农业机器人中的自动化采摘。

Details

Motivation: 由于劳动力短缺和成本高昂，水果的自动选择性采收成为研究重点；而真实标注数据稀缺，难以训练高精度的6D姿态估计算法。 Method: 采用YOLOX-6D-Pose单阶段检测框架，并构建一个基于Blender的程序化渲染管线生成高度逼真的合成草莓数据用于训练，无需真实标注数据。 Result: 模型在ADD-S指标上表现良好，RTX 3090速度快，Jetson Orin Nano适合边缘部署；能准确识别成熟和半熟草莓的姿态，但对未成熟草莓检测仍有挑战。 Conclusion: 该方法可有效利用合成数据实现草莓6D姿态估计，具备在资源受限设备上的部署能力，并可扩展至苹果、桃子等其他水果，推动农业自动化发展。 Abstract: Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. We employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. To address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model's performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color. Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.

[164] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan,Dimitrios Mallios,Parth Pathak,Faegheh Sardari,Thomas Seidl,Gedas Bertasius,Mohsen Fayyaz,Sunando Sengupta

Main category: cs.CV

TL;DR: 本文提出了DocSLM，一种面向资源受限边缘设备的高效小型视觉语言模型，用于长文档理解。

Details

Motivation: 大型视觉语言模型（LVLMs）在长而复杂的文档上表现出色，但其高内存占用限制了在边缘设备上的部署。因此需要更高效的模型。 Method: 提出Hierarchical Multimodal Compressor，联合编码每页的视觉、文本和布局信息为固定长度序列；引入Streaming Abstention机制，通过熵基不确定性校准器逐段处理并过滤低置信度响应。 Result: 在多个长文档多模态基准上，DocSLM性能达到或超过现有最先进方法，同时减少82%视觉token、75%参数量和71%延迟。 Conclusion: DocSLM在显著降低资源消耗的同时保持高性能，实现了在轻量级边缘设备上的可靠多模态文档理解。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.

[165] YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation

Pavel Rojtberg,Julius Kühn

Main category: cs.CV

TL;DR: 提出YCB-Ev SD，一个用于6DoF物体位姿估计的事件相机合成数据集，包含50,000个标准定义分辨率的事件序列，并通过系统评估展示了最优的事件表示方法。

Details

Motivation: 填补事件相机视觉领域缺乏高质量、大规模合成数据集的空白，支持基于CNN的6DoF物体位姿估计研究。 Method: 基于物理渲染（PBR）和BOP方法生成事件数据，模拟线性相机运动以覆盖完整场景；采用时间表面与双通道极性编码进行事件表示分析。 Result: 实验表明，线性衰减的时间表面结合双通道极性编码在位姿估计中表现最佳，显著优于指数衰减和单通道方法；极性信息和线性时间编码对性能提升最为关键。 Conclusion: YCB-Ev SD为事件相机在6DoF位姿估计中的应用提供了可复现的基准和实用数据资源，推动事件驱动视觉的发展。 Abstract: We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/paroj/ycbev_sd.

[166] Free3D: 3D Human Motion Emerges from Single-View 2D Supervision

Sheng Liu,Yuanzhi Liang,Sidan Du

Main category: cs.CV

TL;DR: 提出Free3D框架，无需3D动作标注即可生成高质量3D人体运动，通过2D数据训练实现与全监督方法相当甚至更优的性能。

Details

Motivation: 现有3D人体运动生成模型依赖精确的3D监督，导致泛化能力差，难以超越训练分布。 Method: 提出Free3D框架，包含Motion-Lifting Residual Quantized VAE（ML-RQ）将2D动作映射到3D一致的潜在空间，并设计一系列无需3D标注的正则化目标，确保视角一致性、方向连贯性和物理合理性。 Result: 在纯2D动作数据上训练，Free3D能生成多样、时序连贯且语义合理的3D运动，性能媲美甚至超过完全3D监督的方法。 Conclusion: 放松显式3D监督有助于提升模型对结构和语义的理解，增强泛化能力，为3D运动生成提供更可扩展和数据高效的新范式。 Abstract: Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.

[167] Unsupervised Segmentation of Micro-CT Scans of Polyurethane Structures By Combining Hidden-Markov-Random Fields and a U-Net

Julian Grolig,Lars Griem,Michael Selzer,Hans-Ulrich Kauczor,Simon M. F. Triphan,Britta Nestler,Arnd Koeppe

Main category: cs.CV

TL;DR: 提出了一种结合隐马尔可夫随机场（HMRF）和卷积神经网络（CNN）的无监督分割方法HMRF-UNet，可在无需大量标注数据的情况下实现高精度、快速的材料图像分割，并在PU泡沫的μCT图像上验证了其有效性。

Details

Motivation: 现有图像分割方法在精度、速度或对标注数据的依赖方面存在不足，需要一种既能避免依赖大量标注数据，又能保持高精度和快速分割的方法。 Method: 将HMRF的无监督学习框架与CNN（特别是UNet架构）相结合，设计HMRF-UNet模型，并引入一种新的无监督HMRF损失函数，研究不同邻域项对分割的影响；同时提出一种预训练策略以减少对真实标注数据的需求。 Result: HMRF-UNet在无需真实标签的情况下，在PU泡沫的μCT图像数据集上实现了高分割精度；同时分割速度快，且提出的预训练策略显著减少了训练所需的真实标注数据量。 Conclusion: 结合HMRF与CNN的HMRF-UNet是一种有效的无监督图像分割方法，平衡了准确性、速度和对标注数据的依赖，为材料科学中的数字表征提取提供了一种实用解决方案。 Abstract: Extracting digital material representations from images is a necessary prerequisite for a quantitative analysis of material properties. Different segmentation approaches have been extensively studied in the past to achieve this task, but were often lacking accuracy or speed. With the advent of machine learning, supervised convolutional neural networks (CNNs) have achieved state-of-the-art performance for different segmentation tasks. However, these models are often trained in a supervised manner, which requires large labeled datasets. Unsupervised approaches do not require ground-truth data for learning, but suffer from long segmentation times and often worse segmentation accuracy. Hidden Markov Random Fields (HMRF) are an unsupervised segmentation approach that incorporates concepts of neighborhood and class distributions. We present a method that integrates HMRF theory and CNN segmentation, leveraging the advantages of both areas: unsupervised learning and fast segmentation times. We investigate the contribution of different neighborhood terms and components for the unsupervised HMRF loss. We demonstrate that the HMRF-UNet enables high segmentation accuracy without ground truth on a Micro-Computed Tomography ($μ$CT) image dataset of Polyurethane (PU) foam structures. Finally, we propose and demonstrate a pre-training strategy that considerably reduces the required amount of ground-truth data when training a segmentation model.

[168] Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis

Feng-Qi Cui,Jinyang Huang,Ziyu Jia,Xinyu Li,Xin Yan,Xiaokang Zhou,Meng Wang

Main category: cs.CV

TL;DR: 提出了一种基于低秩稀疏原理的视频情感理解框架LSEF，通过分层建模情感基底与瞬时波动，显著提升了模型的鲁棒性和动态辨别能力。

Details

Motivation: 现有视频情感计算方法因缺乏对情感成分（如长期情感基调与短期波动）的解耦机制，导致模型不稳定和表征退化。 Method: 提出低秩稀疏情感理解框架LSEF，包含三个模块：稳定性编码模块（SEM）捕获低秩情感基底，动态解耦模块（DDM）分离稀疏瞬时信号，一致性融合模块（CIM）重构多尺度稳定性与反应性；并采用秩感知优化（RAO）策略进行自适应训练。 Result: 在多个数据集上实验表明，LSEF在情感识别任务中显著提升模型鲁棒性和动态区分能力，验证了低秩稀疏分层建模的有效性与通用性。 Conclusion: LSEF通过引入层次化的低秩稀疏结构，有效解耦情感动态中的稳定基底与瞬时变化，为视频情感计算提供了更稳定、可解释的建模范式。 Abstract: Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

[169] MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Manyu Li,Ruian He,Chenxi Ma,Weimin Tan,Bo Yan

Main category: cs.CV

TL;DR: 本文提出了MicroVQA++，一个大规模、高质量的显微镜视觉问答（VQA）语料库，通过三阶段方法从BIOMEDICA档案中构建，显著提升了生物医学图像理解中的科学推理能力。

Details

Motivation: 现有的多模态大语言模型在生物医学成像中的应用受限于缺乏大规模、高质量的训练数据，尤其是在显微镜领域的科学推理方面。 Method: 提出三阶段构建方法：第一阶段利用专家验证的图文对进行监督学习；第二阶段引入HiCQA-Graph图结构，结合文本蕴含、跨模态对齐和智能体信号进行不一致样本过滤；第三阶段使用多模态大模型生成选择题并辅以人工筛选。 Result: 构建了包含大规模训练集和人工审核测试集的MicroVQA++数据集，其高阶认知难度样本分布超过原有MicroVQA基准，并使4B规模的MLLM在显微镜推理任务上达到接近GPT-5的性能，且在开源模型中达到SOTA。 Conclusion: 精心设计的数据构建流程——结合专家文献、图结构一致性过滤与人工精炼——能显著提升多模态大模型在专业领域科学推理中的表现。 Abstract: Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

Jiaxi Huang,Dongxu Wu,Hanwei Zhu,Lingyu Zhu,Jun Xing,Xu Wang,Baoliang Chen

Main category: cs.CV

TL;DR: 提出Q-Doc三层次评估框架，系统评估多模态大模型在文档图像质量评估（DIQA）中的能力，发现其存在评分不一致、失真误判等问题，并验证思维链（CoT）提示可显著提升性能。

Details

Motivation: 多模态大语言模型（MLLMs）在高层视觉任务中表现出色，但在文档图像质量评估（DIQA）方面潜力尚未充分探索，缺乏系统性评估方法。 Method: 构建Q-Doc三层次评估框架：粗粒度（质量评分与人类标注相关性分析）、中粒度（单选与多选失真类型识别）、细粒度（基于人工标注的失真严重程度分类），并引入思维链（CoT）提示以提升模型表现。 Result: 实验表明MLLMs具备初步DIQA能力但存在明显缺陷，包括评分不一致、失真类型误识别和严重程度判断错误；采用CoT提示后在各层级任务上性能显著提升。 Conclusion: Q-Doc为MLLMs的DIQA能力提供了有效评估基准，揭示了当前模型在文档质量感知上的不足，同时指出通过推理机制（如CoT）优化具有改进潜力。 Abstract: The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

[171] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

Lan Li,Tao Hu,Da-Wei Zhou,Han-Jia Ye,De-Chuan Zhan

Main category: cs.CV

TL;DR: 提出了一种名为BOFA的新框架，用于类别增量学习（CIL），通过在CLIP的跨模态桥接层中进行正交低秩融合，避免遗忘并提升视觉-语言模态融合性能，无需增加额外参数或推理成本。

Details

Motivation: 现有方法在将CLIP应用于类别增量学习时面临模型复杂度高和模态融合不充分的问题，且容易遗忘旧知识。 Method: BOFA框架将所有适应过程限制在CLIP已有的跨模态桥接层中，采用正交低秩融合技术，使参数更新位于与过去任务特征正交的低秩安全子空间内，并结合跨模态混合原型进行分类。 Result: 在多个标准基准上实验表明，BOFA在保持高效的同时显著优于现有方法的分类准确率。 Conclusion: BOFA有效解决了CIL中模型扩展性和知识遗忘的问题，充分利用了视觉-语言模型的多模态表示能力，实现了高效稳定的持续学习。 Abstract: Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

[172] Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment

Lukun Wu,Jie Li,Ziqi Ren,Kaifan Zhang,Xinbo Gao

Main category: cs.CV

TL;DR: 提出一种自适应教学范式（adaptive teaching paradigm）来解决视觉与脑电信号之间的不对称性问题，通过ShrinkAdapter模块实现视觉特征向EEG模态的适配，在零样本脑-图检索任务中达到60.2%的top-1准确率，超越先前最优方法9.8%。

Details

Motivation: 现有跨模态对齐方法忽视了视觉与脑电（EEG）模态间的根本不对称性，即保真度差距（Fidelity Gap）和语义差距（Semantic Gap），导致泛化能力差。 Method: 提出自适应教学范式，将视觉模态作为‘教师’，EEG作为‘学生’，通过任务引导动态调整教师的知识结构；设计ShrinkAdapter模块，采用无残差设计和瓶颈结构，使视觉特征适配EEG的表达能力。 Result: 在零样本脑-图检索任务上实现了60.2%的top-1准确率，比之前最先进方法提升9.8%；实验验证了所提范式的有效性和合理性。 Conclusion: 为视觉与脑电的跨模态对齐提供了新视角：教师模态需主动收缩和适应学生模态的能力，而非强制对等对齐，从而更有效地弥合模态鸿沟。 Abstract: Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG's inherent noise and signal degradation, vs. vision's high-fidelity features) and a Semantic Gap (arising from EEG's shallow conceptual representation, vs. vision's rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the ``teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the ``student" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2\% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8\%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.

[173] Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira,Alexandre Bernardino,Bruno Martins

Main category: cs.CV

TL;DR: 本文提出了一种多语言指代表达理解（REC）方法，构建了一个包含10种语言的统一数据集，并引入了基于注意力的神经架构，使用多语言SigLIP2编码器实现跨语言视觉定位，实验显示其在多语言场景下具有良好的性能和可行性。

Details

Motivation: 现有REC研究主要集中在英语，难以满足全球部署需求，因此需要开发支持多语言的REC系统以提升跨语言视觉理解能力。 Method: 通过机器翻译和基于上下文的翻译增强技术，将12个现有的英文REC基准扩展为覆盖10种语言的多语言数据集；设计了一种注意力锚定的神经网络架构，利用多语言SigLIP2编码器生成粗略的空间锚点，并通过学习残差进行精细化定位。 Result: 构建的数据集包含约800万条多语言指代表达，涵盖177,620张图像和336,882个标注对象；模型在RefCOCO多语言评估中达到86.9%的准确率（IoU@50），接近英文单语的91.3%，且在多种语言上表现稳定。 Conclusion: 该研究表明多语言REC系统的可行性，所提出的架构和数据集为跨语言视觉 grounding 提供了有效基础，推动了非英语环境下的视觉语言理解发展。 Abstract: Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

[174] WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow,Jiachun Pan,Yongyuan Liang,Mingze Zhou,Xue Song,Liyu Jia,Saining Zhang,Siliang Tang,Juncheng Li,Fengda Zhang,Weijia Wu,Hanwang Zhang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了WEAVE，首个用于上下文交错跨模态理解与生成的套件，包括大规模数据集WEAVE-100k和人类标注的评估基准WEAVEBench，旨在推动统一多模态模型在多轮、上下文感知图像生成与编辑中的能力研究。

Details

Motivation: 现有数据集和基准主要关注单轮交互，无法捕捉真实世界中图像创建与编辑的多轮、依赖上下文的特点，因此需要一个支持上下文交错理解与生成的新框架。 Method: 构建了包含10万条交错样本的WEAVE-100k数据集和基于480张图像的WEAVEBench评估基准，并设计了结合参考图像与原始图像及编辑指令的混合VLM评判框架，用于评估模型在多轮生成、视觉记忆和世界知识推理方面的能力。 Result: 实验表明，在WEAVE-100k上训练可提升模型的视觉理解、图像编辑和生成协作能力，并涌现出视觉记忆能力；在WEAVEBench上的评估揭示了当前方法在多轮上下文感知任务中的局限性。 Conclusion: WEAVE为研究多模态模型在上下文交错理解与生成方面提供了新的视角和基础，有助于推动多模态领域的发展。 Abstract: Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.

[175] The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Maria-Teresa De Rosa Palmini,Eva Cetinic

Main category: cs.CV

TL;DR: 本文研究了文本到图像扩散模型中的多模态图示性现象，提出了一种区分文化引用识别与再现方式的评估框架，揭示模型不仅复制还转化文化知识。

Details

Motivation: 解决现有研究过度关注遗忘而忽视模型记忆内容与方式的问题，探讨模型在文化引用上的识别与再现平衡。 Method: 构建识别与实现双维度评估框架，通过767个Wikidata文化引用对五个扩散模型进行评估，并开展同义词替换和文字描述扰动实验以测试语言敏感性。 Result: 所提框架能更有效区分复制与转换；模型在文本提示改变时仍倾向于复现标志性视觉结构；文化对齐与训练数据频率、文本独特性、引用流行度和创作时间相关。 Conclusion: 扩散模型的价值在于其对文化知识的转化与重构能力，应超越简单的图文匹配，转向更丰富的上下文理解评估。 Abstract: Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.

[176] Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping

Guowei Zhang,Yun Zhao,Moein Khajehnejad,Adeel Razi,Levin Kuhlmann

Main category: cs.CV

TL;DR: 提出Hi-DREAM，一种基于皮层层级结构的扩散模型，通过将fMRI信号划分为早、中、晚期视觉流并构建多尺度皮层金字塔，提升自然图像解码性能与可解释性。

Details

Motivation: 现有基于扩散模型的脑-图解码方法通常直接以fMRI特征为条件，忽略视觉信息在大脑皮层中的分层组织结构，导致无法清晰区分不同视觉区域的功能角色。 Method: 设计ROI适配器将fMRI信号按视觉通路分组，形成与U-Net深度对齐的多尺度皮层金字塔；引入轻量级、深度匹配的ControlNet在去噪过程中注入尺度特定的皮层提示。 Result: 在NSD数据集上，Hi-DREAM在高层语义指标上达到SOTA水平，同时保持良好的低层保真度。 Conclusion: 按皮层层级结构组织条件信号是一种优于纯数据驱动嵌入的有效策略，不仅提升解码质量，还为研究视觉皮层功能提供新视角。 Abstract: Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

[177] VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu,Jinpeng Chen,Yuzhi Zhao,Jason Chun Lok Li,Yue Qiu,Zekang Du,Mengyang Wu,Pingping Zhang,Kun Li,Hongzheng Yang,Wenao Ma,Jiaheng Wei,Qinbin Li,Kangcheng Liu,Wenqiang Lei

Main category: cs.CV

TL;DR: 本文提出了VP-Bench，首个用于评估多模态大语言模型（MLLMs）对视觉提示（如边界框）感知与利用能力的基准。该基准包含两个阶段：第一阶段测试模型在自然场景中感知视觉提示的能力，第二阶段评估其在实际任务中的有效性。作者评估了28个主流MLLM，并分析了影响性能的关键因素。

Details

Motivation: 现有MLLM缺乏对人类直观使用的视觉提示（如框选区域）的理解能力评估，且无系统性基准支持此类研究，导致模型在此类自然交互方式上的表现不明确。 Method: 提出VP-Bench，采用两阶段评估框架：第一阶段使用3万个包含八种形状和355种属性组合的可视化提示测试VP感知；第二阶段评估VP在下游任务中的实际应用效果。共测试28个MLLM，涵盖闭源与开源模型。 Result: 实验表明当前MLLM在视觉提示理解上存在显著差异，部分模型能有效利用VP提升性能，而多数模型对VP属性变化敏感，且问题顺序和模型规模影响VP理解效果。 Conclusion: VP-Bench为MLLM的视觉提示理解提供了系统评估标准，揭示了现有模型在感知与利用视觉提示方面的不足，推动未来研究关注更贴近人类直觉的交互方式。 Abstract: Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

[178] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss,Moritz Langenberg,Yannick Kirchhoff,Fabian Isensee,Benjamin Hamm,Constantin Ulrich,Sebastian Regnery,Lukas Bauer,Efthimios Katsigiannopulos,Tobias Norajitra,Klaus Maier-Hein

Main category: cs.CV

TL;DR: VoxTell 是一个用于文本提示的三维医学图像分割的视觉-语言模型，能够在未见过的数据集上实现跨模态的零样本分割，并展现出对临床语言和语言变化的鲁棒性。

Details

Motivation: 现有的医学图像分割模型通常依赖于固定的类别标签或需要大量标注数据，难以灵活应对自由形式的临床描述和新类别。因此，需要一种能够理解自然语言并实现零样本分割的通用模型。 Method: VoxTell 采用多阶段视觉-语言融合机制，在解码器的多个层次上对齐文本与视觉特征，支持从单个词到完整句子的自由文本输入，并在超过6.2万个体积数据上训练，涵盖1000多个解剖和病理类别。 Result: VoxTell 在多个未见数据集上实现了最先进的零样本性能，表现出优异的跨模态迁移能力、对语言变异和临床术语的鲁棒性，以及对真实世界文本的实例级分割准确性。 Conclusion: VoxTell 展示了基于文本提示的通用医学图像分割的可行性，为零样本、跨模态和临床可解释的分割系统提供了新方向。 Abstract: We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

[179] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification

Qinghao Gao,Jianhai Qu,Yunsong Li,Weiqiang Dong

Main category: cs.CV

TL;DR: 提出了一种名为Missing-aware Mixture-of-Loras（MaMOL）的框架，通过双路由机制解决遥感中多模态分类因模态缺失导致性能下降的问题，具有高效、鲁棒且可迁移的特点。

Details

Motivation: 遥感中的多模态分类常因环境干扰、传感器故障等原因导致模态缺失，现有方法计算成本高且训练时假设模态完整，难以应对实际中的不完整性。 Method: 将模态缺失重构为多任务学习问题，提出MaMOL框架，包含任务导向的动态路由器和模态特定共享的静态路由器，实现轻量化的专家更新与共享，支持参数高效的自适应。 Result: 在多个遥感基准上验证了MaMOL在不同缺失率下的优越鲁棒性和泛化能力，计算开销小；在自然图像数据集上的迁移实验也证明其跨域适用性和可扩展性。 Conclusion: MaMOL是一种通用且高效的不完整多模态学习解决方案，显著提升了模态缺失情况下的分类性能与实用性。 Abstract: Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.

[180] Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

Davide Napolitano,Luca Cagliero,Fabrizio Battiloro

Main category: cs.CV

TL;DR: 本文提出了VRD-UQA基准，用于评估视觉大语言模型（VLLMs）在面对看似合理但无法回答的问题时的鲁棒性，特别是在多页视觉丰富文档中的表现。

Details

Motivation: 尽管VLLMs在视觉问答（VQA）任务中表现出色，但其对无法回答问题的检测能力仍不明确，尤其是在存在语义或布局干扰的情况下。 Method: 通过替换文档中同类型实体生成多种干扰问题，利用VLLM-as-a-judge验证问题的不可回答性，并在12个模型上评估不同干扰类型和知识注入策略的影响。 Result: 实验表明现有VLLMs在检测不可回答问题方面存在局限，尤其在页面级和文档级判断上表现不佳，且不同干扰类型显著影响性能。 Conclusion: VRD-UQA可作为评估和提升VLLMs在复杂文档场景下鲁棒性的有效框架，揭示了未来构建更具弹性的文档问答系统的方向。 Abstract: The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

[181] Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery

Yijie Kang,Xinliang Wang,Zhenyu Wu,Yifeng Shi,Hailong Zhu

Main category: cs.CV

TL;DR: 提出Sat2RealCity，一种基于卫星图像的几何感知与外观可控的3D城市生成框架，通过以单体建筑为生成基础，结合OSM空间先验、外观控制机制和MLLM语义引导 pipeline，显著提升生成城市的结构一致性和视觉真实感。

Details

Motivation: 现有3D城市生成方法依赖大规模3D资产和语义/高度图，难以获取且缺乏真实外观关联，限制了现实对齐与泛化能力。 Method: 1) 基于OSM的空间先验实现从拓扑到建筑实例的几何生成；2) 外观引导的可控建模机制提升细节真实感与风格控制；3) MLLM驱动的语义到几何生成 pipeline。 Result: 在定量与定性实验中均优于现有基线方法，显著提升结构一致性与外观 realism。 Conclusion: Sat2RealCity减少了对大规模3D城市数据的依赖，实现了更真实、可控且与现实对齐的3D城市生成，为数字孪生等应用提供新基础。 Abstract: Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.

[182] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

Kaishen Wang,Ruibo Chen,Tong Zheng,Heng Huang

Main category: cs.CV

TL;DR: 本文提出了ImAgent，一种无需训练的统一多模态代理框架，通过集成推理、生成与自评估，在测试时动态提升文本到图像生成的质量和语义一致性。

Details

Motivation: 现有文本到图像模型在提示模糊时易产生随机和不一致的结果，且现有改进方法依赖额外模块，影响测试时扩展效率。 Method: 设计一个无需训练的多模态代理ImAgent，由策略控制器引导，多个生成动作在单一框架内动态交互与自组织，实现生成过程的自我优化。 Result: 在图像生成与编辑任务上，ImAgent持续优于基线模型，并在基线失败的情况下仍表现优越，验证了其在测试时扩展下的有效性。 Conclusion: ImAgent展示了无需额外训练和外部模型的统一代理在提升图像生成质量与语义对齐方面的潜力，为高效自适应生成提供了新方向。 Abstract: Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

[183] Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images

Roman Kinakh,Gonzalo R. Ríos-Muñoz,Arrate Muñoz-Barrutia

Main category: cs.CV

TL;DR: 提出nnUNet-B：一种基于贝叶斯分割框架的方法，通过H&E染色图像直接推断PD-L1表达，并提供不确定性估计。

Details

Motivation: 现有基于免疫组化（IHC）的PD-L1检测方法资源消耗大，亟需一种更高效、可扩展的替代方案。 Method: 基于nnUNet-v2构建，采用多模态后验采样（MPS）策略，在循环训练中采样多个模型检查点以近似后验分布，实现PD-L1表达的分割及认知不确定性（通过熵和标准差）估计。 Result: 在肺鳞癌数据集上，平均Dice分数达0.805，平均IoU为0.709，性能与现有基线相当；生成的像素级不确定性图与分割误差高度相关，但校准仍有改进空间。 Conclusion: 该研究表明，具备不确定性感知的H&E图像PD-L1预测是迈向可扩展、可解释临床生物标志物评估的重要一步。 Abstract: Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from H&E-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware H&E-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.

[184] PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models

Nhat Hoang-Xuan,Minh Vu,My T. Thai,Manish Bhattarai

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的轻量级方法PAS，用于检测大型视觉-语言模型中的对象幻觉，通过利用预生成文本标记的注意力权重实现高效干预。

Details

Motivation: 大型视觉-语言模型存在对象幻觉问题，导致对图像理解不可靠，本文旨在识别并缓解这一问题。 Method: 引入Prelim Attention Score (PAS)，基于注意力权重计算先前生成标记的影响，并利用条件互信息分析图像与预测对象之间的依赖关系。 Result: PAS在多个模型和数据集上实现了最先进的对象幻觉检测性能，且无需额外前向传播，可实时计算。 Conclusion: PAS是一种有效的训练-free信号，可用于实时检测和过滤LVLM中的对象幻觉，提升模型可靠性。 Abstract: Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.

[185] OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning

Xiaoyu Zheng,Xu Chen,Awais Rauf,Qifan Fu,Benedetta Monosi,Felice Rivellese,Myles J. Lewis,Shaogang Gong,Gregory Slabaugh

Main category: cs.CV

TL;DR: 本文提出了OpenUS，首个基于大规模公开数据构建的可复现、开源超声波基础模型，采用视觉Mamba骨干网络和自适应掩码框架，结合对比学习与掩码图像建模，提升了特征提取能力和预训练效果。

Details

Motivation: 超声图像解释高度依赖操作者，且在不同解剖区域、采集协议和设备类型间差异显著，加之斑点噪声、低对比度和缺乏标准化标注等挑战，限制了通用、标签高效AI模型的发展。 Method: 采用视觉Mamba作为骨干网络，引入一种新型自适应掩码框架，结合对比学习与掩码图像建模，并利用教师网络注意力图与学生网络重建损失动态调整临床相关区域的掩码；同时应用动态学习调度策略逐步增加预训练难度。 Result: 构建了迄今为止最大的公开超声数据集，包含来自42个公开数据集的30.8万张图像，覆盖多种解剖区域、机构、设备和疾病类型；预训练后的OpenUS模型在下游任务中表现出良好的标签效率和适应性。 Conclusion: OpenUS是首个开源的超声基础模型，通过自适应掩码和动态学习策略，在多样化超声数据上实现了有效的特征学习，为标签高效的超声AI应用提供了可复现的基础模型。 Abstract: Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.

[186] CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation

Luthira Abeykoon,Ved Patel,Gawthaman Senthilvelan,Darshan Kasundra

Main category: cs.CV

TL;DR: 本文提出了一种名为CVChess的深度学习框架，能够将智能手机拍摄的物理棋盘图像转换为FEN表示，并通过在线国际象棋引擎提供最佳走法建议。

Details

Motivation: 由于疫情期间在线国际象棋平台的普及，观众数量大幅增加，但实体棋盘缺乏相应的辅助工具，导致数字与实体棋类体验之间存在差距。因此，需要一种技术来弥合这一鸿沟。 Method: 采用带有残差层的卷积神经网络（CNN）进行棋子识别，处理流程包括：使用霍夫线变换进行图像预处理和边缘检测、投影变换实现俯视图对齐、将棋盘分割为64个方格，并利用残差CNN对13类棋子（白方6类、黑方6类及空格）进行分类。 Result: 系统在包含10,800张标注图像的ChessReD数据集上进行了训练和评估，这些图像涵盖多种光照条件和拍摄角度；模型能准确生成FEN字符串，可用于输入棋类引擎以获得最优走法。 Conclusion: CVChess成功实现了从真实世界棋盘图像到可被棋类引擎解析的FEN编码的自动化转换，为实体国际象棋游戏提供了智能化辅助手段，有效缩小了实体与数字棋类体验之间的差距。 Abstract: Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move

[187] Bridging Hidden States in Vision-Language Models

Benjamin Fein-Ashley,Jacob Fein-Ashley

Main category: cs.CV

TL;DR: 提出了一种轻量级融合模块BRIDGE，通过在编码器顶部添加双向交叉注意力层，实现视觉与语言模态隐状态的直接对齐，在保持双编码器效率的同时提升了跨模态理解性能。

Details

Motivation: 现有视觉-语言模型通常采用早期或晚期融合方式，且常依赖自回归解码器，难以有效利用两种模态各自的结构信息（如视觉的空间布局和文本的语法语义），因此需要一种更自然的模态对齐方法。 Method: 设计了一个轻量级的融合模块，包含若干仅进行跨模态双向注意力的层，置于两个编码器顶部；每层将视觉和文本的隐藏状态映射到共享空间，进行跨模态注意力计算，并通过门控残差连接返回更新，辅以简单稳定机制提升对齐效果；编码器保持非因果性，生成任务通过可选解码器解耦处理。 Result: 在标准的检索、视觉问答（VQA）和视觉推理基准测试上，BRIDGE优于同类视觉-语言模型，同时保持了对比学习模型的双编码器效率。 Conclusion: 通过直接对齐视觉与语言模态的隐状态，BRIDGE实现了更有效的跨模态融合，在多种任务上表现优越，并兼顾了模型效率与生成灵活性。 Abstract: Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.

[188] LARM: A Large Articulated-Object Reconstruction Model

Sylvia Yuan,Ruoxi Shi,Xinyue Wei,Xiaoshuai Zhang,Hao Su,Minghua Liu

Main category: cs.CV

TL;DR: LARM是一个统一的前馈框架，能够从稀疏视角图像中联合恢复三维关节物体的精细几何、真实纹理和准确的关节结构，相较于现有方法在效率和重建质量上均有显著提升。

Details

Motivation: 现有的三维关节物体重建方法要么需要密集多视角输入和昂贵的实例优化，限制了可扩展性；要么前馈方法速度较快但往往生成粗糙的几何形状，缺乏纹理重建，并依赖复杂且脆弱的多阶段流程。因此，需要一种高效、高质量且可扩展的统一框架。 Method: LARM基于最近用于静态三维物体新视角合成（NVS）的LVSM方法，通过引入基于Transformer的架构，在关节物体场景下联合推理相机姿态和关节变化，扩展其应用范围。该方法同时生成深度图和部件掩码等辅助输出，以支持显式的三维网格提取和关节估计，整个流程无需密集监督。 Result: 实验表明，LARM在新视角合成、状态合成以及三维关节物体重建方面均优于现有最先进方法，能够生成高质量、高保真的三维网格，且紧密贴合输入图像，适用于多种物体类别。 Conclusion: LARM实现了一个高效、统一的前馈框架，能够在稀疏视角输入下完成高质量的三维关节物体重建，兼顾几何细节、纹理真实性和运动学结构，具有良好的可扩展性和实际应用潜力。 Abstract: Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/

Table of Contents

cs.CL [Back]

[1] Unsupervised Cycle Detection in Agentic Applications

[2] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLMs

[3] Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI

[4] Hybrid Quantum Transformer for Language Generation

[5] Empirical Characterization of Temporal Constraint Processing in LLMs

[6] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging, Entailment Filtering, and Knowledge Graph Alignment

[7] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

[8] Patent Representation Learning via Self-supervision

[9] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

[10] Information Extraction From Fiscal Documents Using LLMs

[11] Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

[12] Bayesian Evaluation of Large Language Model Behavior

[13] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish

[14] Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models

[15] Evaluating LLM Understanding via Structured Tabular Decision Simulations

[16] Forecasting Spoken Language Development in Children with Cochlear Implants Using Preimplantation MRI

[17] Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

[18] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

[19] Large language models in materials science and the need for open-source approaches

[20] Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL

[21] Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification

[22] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

[23] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

[24] A methodological analysis of prompt perturbations and their effect on attack success rates

[25] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

[26] Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data

[27] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

[28] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

[29] Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate

[30] Where does an LLM begin computing an instruction?

[31] "As Eastern Powers, I will veto." : An Investigation of Nation-level Bias of Large Language Models in International Relations

[32] $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

[33] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs

[34] TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

[35] Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior

[36] LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

[37] Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

[38] Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English

[39] Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

[40] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

[41] ICX360: In-Context eXplainability 360 Toolkit

[42] A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

[43] MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

[44] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

[45] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

[46] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

[47] Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy

[48] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

[49] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology

[50] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

[51] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

[52] Automata-Based Steering of Large Language Models for Diverse Structured Generation

[53] Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

[54] Can LLMs Detect Their Own Hallucinations?

[55] Analysing Personal Attacks in U.S. Presidential Debates

[56] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

[57] Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion

[58] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

[59] PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

[60] Adverbs Revisited: Enhancing WordNet Coverage of Adverbs with a Supersense Taxonomy

[61] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation

[62] KGQuest: Template-Driven QA Generation from Knowledge Graphs with LLM-Based Refinement

[63] iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

[64] destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

[65] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

[66] NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery

[67] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

[68] M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text

[69] Studies with impossible languages falsify LMs as models of human language

[70] MajinBook: An open catalogue of digital world literature with likes

[71] Proactive Hearing Assistants that Isolate Egocentric Conversations

[72] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

[73] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

cs.CV [Back]

[74] A Mathematical Framework for AI Singularity: Conditions, Bounds, and Control of Recursive Improvement

[75] Semantic VLM Dataset for Safe Autonomous Driving

[76] Fast Data Attribution for Text-to-Image Models

[77] Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow