cs.CL [Back]

[1] Evaluating LLMs' Reasoning Over Ordered Procedural Steps

Adrita Anika,Md Messal Monem Miah

Main category: cs.CL

TL;DR: 研究了大型语言模型在零样本和少样本设置下从打乱的步骤中重建有序过程序列的能力，使用食谱数据集并采用排序和序列对齐指标进行评估。

Details

Motivation: 探索大型语言模型在过程推理中的能力，特别是在步骤顺序影响结果的任务中。 Method: 使用整理的食谱数据集，在零样本和少样本设置下评估多个大型语言模型，并采用Kendall's Tau、NLCS和NED等指标进行综合评估。 Result: 模型性能随序列长度增加而下降，输入步骤位移越大性能越差。 Conclusion: 当前大型语言模型在处理较长且更混乱的过程序列时存在局限性。 Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall's Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.

[2] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Peiyu Li,Xiuxiu Tang,Si Chen,Ying Cheng,Ronald Metoyer,Ting Hua,Nitesh V. Chawla

Main category: cs.CL

TL;DR: ATLAS 是一种基于项目反应理论（IRT）的自适应测试框架，通过 Fisher 信息引导的题目选择，在仅使用少量题目的情况下即可高精度评估大语言模型能力，显著降低评估成本，并揭示传统静态基准中题目质量不均和排名偏差问题。

Details

Motivation: 传统的语言模型评估方法使用固定题库计算平均准确率，忽略了题目质量和信息量的差异，导致评估效率低且易受标注错误影响。需要更高效、精准的动态评估方法。 Method: 提出 ATLAS 框架，结合项目反应理论（IRT）与 Fisher 信息最大化策略，自适应选择最具信息量的测试题目；对五个主流基准进行分析，识别低质量题目，并在 HellaSwag 等数据集上验证其性能。 Result: 在 HellaSwag 数据集上仅用 42 个题目（原 5,608）即达到与完整基准相当的估计精度（MAE=0.154），题目减少 90%；模型曝光率低于 10%，测试重叠度为 16-27%；发现 3-6% 的题目具有负区分度，表明存在标注错误；IRT 排名与准确率排名存在显著差异，23-31% 的模型排名变动超过 10 位。 Conclusion: ATLAS 能大幅减少评估所需题目数量，提升评估效率和精度，同时暴露静态评估中题目质量问题和排名偏差，为大规模模型评估提供了更科学、经济的解决方案。 Abstract: Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.

[3] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection

Jingqing Wang,Jiaxing Shang,Rong Xu,Fei Hao,Tianjin Huang,Geyong Min

Main category: cs.CL

TL;DR: 提出了一种基于情感增强角色聚类的框架SARC，用于提升虚假新闻检测性能。

Details

Motivation: 现有方法通常将情感特征作为辅助信号，忽略了不同用户角色对相同情感极性的影响，限制了检测效果。 Method: 通过BiGRU和注意力机制联合表示评论文本，并结合情感编码生成用户特征；构建可微分的深度聚类模块自动分类用户角色；采用联合优化目标，同时优化角色聚类和虚假新闻检测任务。 Result: 在RumourEval-19和Weibo-comp两个基准数据集上的实验表明，SARC在所有指标上均优于基线模型。 Conclusion: SARC通过情感增强的角色聚类有效提升了虚假新闻检测的性能，验证了考虑用户角色差异的重要性。 Abstract: Fake news detection has been a long-standing research focus in social networks. Recent studies suggest that incorporating sentiment information from both news content and user comments can enhance detection performance. However, existing approaches typically treat sentiment features as auxiliary signals, overlooking role differentiation, that is, the same sentiment polarity may originate from users with distinct roles, thereby limiting their ability to capture nuanced patterns for effective detection. To address this issue, we propose SARC, a Sentiment-Augmented Role Clustering framework which utilizes sentiment-enhanced deep clustering to identify user roles for improved fake news detection. The framework first generates user features through joint comment text representation (with BiGRU and Attention mechanism) and sentiment encoding. It then constructs a differentiable deep clustering module to automatically categorize user roles. Finally, unlike existing approaches which take fake news label as the unique supervision signal, we propose a joint optimization objective integrating role clustering and fake news detection to further improve the model performance. Experimental results on two benchmark datasets, RumourEval-19 and Weibo-comp, demonstrate that SARC achieves superior performance across all metrics compared to baseline models. The code is available at: https://github.com/jxshang/SARC.

[4] Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng,Vidhisha Balachandran,Chan Young Park,Faeze Brahman,Sachin Kumar

Main category: cs.CL

TL;DR: 本文提出将指令层级解析重构为推理任务，通过构建VerIH数据集并使用轻量级强化学习训练模型，使其能够优先遵循高优先级指令，提升大语言模型在复杂场景下的可控性和鲁棒性。

Details

Motivation: 大语言模型在现实决策中需处理来自不同来源的冲突指令，确保高优先级指令（如系统安全策略）能覆盖低优先级请求（如用户输入），对模型的可靠性和可控性至关重要。 Method: 将指令层级解析视为推理任务，要求模型先推理用户提示与高优先级系统指令之间的关系；构建包含一致与冲突指令的VerIH数据集，并采用轻量级强化学习进行微调。 Result: 微调后的模型在指令遵循和指令层级基准上表现更优，且推理能力可泛化至训练分布之外的安全关键场景，有效抵御越狱和提示注入攻击。 Conclusion: 通过推理解决指令层级冲突是一种实现可靠、可控大语言模型的有效路径，系统提示的更新可带来行为上的可预测改进。 Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

[5] EncouRAGe: Evaluating RAG Local, Fast, and Reliable

Jan Strich,Adeline Scharfenberg,Chris Biemann,Martin Semmann

Main category: cs.CL

TL;DR: EncouRAGe是一个用于简化基于大语言模型和嵌入模型的检索增强生成（RAG）系统开发与评估的Python框架，包含五个模块化组件，支持灵活实验和可扩展开发。

Details

Motivation: 为了提升RAG系统的可复现性、评估多样性和本地部署能力，促进研究人员高效评估RAG工作流中的数据集。 Method: 设计了一个包含Type Manifest、RAG Factory、Inference、Vector Store和Metrics五个模块的框架，并在多个基准数据集上进行实现和评估。 Result: 实验结果表明，RAG性能仍低于Oracle Context，Hybrid BM25在所有四个数据集中表现最佳；重排序仅带来轻微性能提升但增加了响应延迟。 Conclusion: EncouRAGe为RAG系统的开发和评估提供了模块化、可扩展且可复现的框架，有助于推动RAG研究的发展。 Abstract: We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.

[6] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

K M Sajjadul Islam,John Fields,Praveen Madiraju

Main category: cs.CL

TL;DR: 本文提出了一种名为multiMentalRoBERTa的微调RoBERTa模型，用于从社交媒体文本中多分类检测常见心理疾病（如抑郁、焦虑、PTSD、自杀意念等），在六类和五类任务中均表现出优越性能，并结合可解释性方法分析关键词汇特征，强调公平性与人机协同安全机制，具备轻量、鲁棒且可部署的特点。

Details

Motivation: 早期识别社交媒体中的心理健康问题对于及时干预和支持至关重要，但不同心理状态之间存在语义重叠和类别混淆，需要更精确、可解释且可靠的自动检测模型。 Method: 基于多个整理数据集，构建并微调RoBERTa模型（multiMentalRoBERTa）进行多分类；通过数据探索分析类别间重叠关系；与传统机器学习、领域专用模型（如MentalBERT）及大语言模型提示方法对比性能；使用Layer Integrated Gradients和KeyBERT进行可解释性分析，识别关键词汇。 Result: multiMentalRoBERTa在六类分类中达到0.839的macro F1分数，五类中达0.870，优于现有方法；发现抑郁与自杀意念、焦虑与PTSD之间有强相关性，压力类别较泛化且重叠明显；可解释性分析成功识别出区分抑郁与自杀意念的关键词汇线索。 Conclusion: 微调后的Transformer模型在心理健康文本检测中表现优异且具可解释性，multiMentalRoBERTa是一个轻量、稳健且适合部署的解决方案，适用于心理健康支持平台，但需结合偏见缓解和人机协作的安全机制以确保实际应用的可靠性。 Abstract: The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions, including stress, anxiety, depression, post-traumatic stress disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple curated datasets, data exploration is conducted to analyze class overlaps, revealing strong correlations between depression and suicidal ideation as well as anxiety and PTSD, while stress emerges as a broad, overlapping category. Comparative experiments with traditional machine learning methods, domain-specific transformers, and prompting-based large language models demonstrate that multiMentalRoBERTa achieves superior performance, with macro F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup (excluding stress), outperforming both fine-tuned MentalBERT and baseline classifiers. Beyond predictive accuracy, explainability methods, including Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues that drive classification, with a particular focus on distinguishing depression from suicidal ideation. The findings emphasize the effectiveness of fine-tuned transformers for reliable and interpretable detection in sensitive contexts, while also underscoring the importance of fairness, bias mitigation, and human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as a lightweight, robust, and deployable solution for enhancing support in mental health platforms.

[7] Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

Haneen Al-Homoud,Asma Ibrahim,Murtadha Al-Jubran,Fahad Al-Otaibi,Yazeed Al-Harbi,Daulet Toibazar,Kesen Wang,Pedro J. Moreno

Main category: cs.CL

TL;DR: Cross-Lingual SynthDocs 是一个大规模合成语料库，旨在解决阿拉伯语OCR和文档理解资源稀缺的问题，包含超过250万样本，涵盖文本、表格和图表，显著提升了多模态任务性能。

Details

Motivation: 解决阿拉伯语在光学字符识别（OCR）和文档理解（DU）领域缺乏高质量标注数据的问题。 Method: 通过真实扫描背景、双语文本布局和带音标字体的渲染管道生成包含文本、表格和图表的合成数据集。 Result: 在多个阿拉伯语基准上，微调Qwen-2.5-VL模型后，词错误率（WER）和字符错误率（CER）显著降低，表格和图表解析的TEDS和CharTeX得分也得到提升。 Conclusion: SynthDocs 提供了一个可扩展且视觉逼真的资源，推动多语言文档分析研究的发展。 Abstract: Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.

[8] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

Song Wang,Zihan Chen,Peng Wang,Zhepei Wei,Zhen Tan,Yu Meng,Cong Shen,Jundong Li

Main category: cs.CL

TL;DR: WinnowRAG 是一种无需模型微调、模型无关的检索增强生成框架，通过查询感知的聚类和LLM代理机制，结合批评者LLM进行文档筛选（winnowing），有效过滤噪声文档并保留有价值信息，提升生成响应的准确性。

Details

Motivation: 传统的RAG方法在增加检索文档数量时容易引入大量噪声，导致生成结果不准确。因此需要一种能够有效区分相关与无关文档的机制，以提升信息利用效率和响应质量。 Method: WinnowRAG分为两个阶段：第一阶段进行查询感知的文档聚类，将相似文档分组并由不同LLM代理生成初步答案；第二阶段引入批评者LLM评估各代理输出，迭代筛选有用文档并剔除噪声，同时采用两种合并策略在去除低质量代理时保留有用信息。 Result: 在多个真实数据集上的实验表明，WinnowRAG在响应准确性和信息利用效率方面显著优于现有最先进基线方法，且具备良好的通用性和可扩展性。 Conclusion: WinnowRAG通过系统化的winnowing机制有效解决了多文档RAG中的噪声问题，是一种高效、灵活且无需微调的RAG增强框架。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources to address their limitations in accessing up-to-date or specialized information. A natural strategy to increase the likelihood of retrieving relevant information is to expand the number of retrieved documents. However, involving more documents could introduce significant noise, as many documents may be irrelevant or misleading, thereby reducing the overall accuracy of the generated responses. To overcome the challenge associated with handling a larger number of documents, we propose WinnowRAG, a novel RAG framework designed to systematically filter out noisy documents while preserving valuable content -- a process we refer to as winnowing. WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters. Each cluster is assigned to an LLM agent for generating a unique answer. In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones. To retain useful documents when discarding agents, we propose two strategic merging techniques to ensure that only relevant knowledge is used for generating the final response. Crucially, WinnowRAG is model-agnostic and does not require any model fine-tuning, making it easily adaptable to various tasks. Extensive experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.

[9] Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean,Ryan Othniel Kearns,Angelika Romanou,Franziska Sofia Hafner,Harry Mayne,Jan Batzner,Negar Foroutan,Chris Schmitz,Karolina Korgul,Hunar Batra,Oishi Deb,Emma Beharry,Cornelius Emde,Thomas Foster,Anna Gausen,María Grandury,Simeng Han,Valentin Hofmann,Lujain Ibrahim,Hazel Kim,Hannah Rose Kirk,Fangru Lin,Gabrielle Kaili-May Liu,Lennart Luettgau,Jabez Magomere,Jonathan Rystrøm,Anna Sotnikova,Yushi Yang,Yilun Zhao,Adel Bibi,Antoine Bosselut,Ronald Clark,Arman Cohan,Jakob Foerster,Yarin Gal,Scott A. Hale,Inioluwa Deborah Raji,Christopher Summerfield,Philip H. S. Torr,Cozmin Ududec,Luc Rocher,Adam Mahdi

Main category: cs.CL

TL;DR: 对445个大语言模型基准进行了系统性综述，发现现有评估在构念效度上的不足，并提出八项改进建议。

Details

Motivation: 准确评估大语言模型的能力、安全性与鲁棒性需要具备良好构念效度的测量方法，但当前基准存在有效性不足的问题。 Method: 联合29位专家对自然语言处理与机器学习顶会中的445个LLM基准进行系统性文献综述，分析所测现象、任务和评分指标的有效性。 Result: 发现了影响评估结论有效性的多种模式，揭示了当前基准在构念效度方面的系统性缺陷。 Conclusion: 提出了八项关键建议和具体可行的指导方针，以提升未来LLM基准的科学性与可靠性。 Abstract: Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.

[10] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Tingyue Yang,Junchi Yao,Yuhui Guo,Chang Liu

Main category: cs.CL

TL;DR: 本文提出了POLIS-Bench，首个针对政府双语政策场景下大语言模型的系统性评估套件，具备最新双语语料、情景化任务设计和双指标评估框架，并通过大规模实验验证了推理模型的优势，同时基于该基准微调出性能媲美甚至超越闭源模型的轻量开源POLIS系列模型。

Details

Motivation: 现有基准在政府双语政策场景下的评估能力有限，缺乏时效性、真实场景任务设计和综合评估指标，难以有效衡量大语言模型在实际政务应用中的表现。 Method: 构建了一个大规模、最新的双语政策语料库；设计了三个基于真实场景的任务（条款检索与解读、方案生成、合规判断）；提出结合语义相似度与准确率的双指标评估框架；并对超过10个主流大模型进行了大规模评测，进一步基于该基准微调轻量开源模型。 Result: 评测揭示了推理类模型在跨任务稳定性与准确性上的优势，尤其在合规任务上表现突出；基于POLIS-Bench微调的轻量POLIS系列模型在多个子任务上达到或超过了强闭源基线模型的性能。 Conclusion: POLIS-Bench为政府双语政策场景提供了更全面、严谨的评估标准，不仅揭示了当前模型的性能差异与挑战，还展示了通过针对性微调实现高效、合规、低成本部署的可行性路径。 Abstract: We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.

[11] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

Hari Mohan Pandey,Anshul Gupta,Subham Sarkar,Minakshi Tomer,Schneider Johannes,Yan Gong

Main category: cs.CL

TL;DR: 本文提出了GEMMA-SQL，一种基于Gemma 2B架构的轻量级高效文本到SQL模型，通过资源高效的微调和多种提示策略，在SPIDER基准上取得了优于多个先进基线模型的性能。

Details

Motivation: 为了提供一个无需专业知识即可与数据库交互的自然语言接口，并解决现有大模型资源消耗高、部署成本大的问题。 Method: 基于开源Gemma 2B架构，采用资源高效、迭代式的微调方法，并结合少样本学习等多种提示策略进行指令微调。 Result: GEMMA-SQL Instruct在SPIDER基准上达到66.8% Test-Suite准确率和63.3% Exact Set Match准确率，超过IRNet、RYANSQL和CodeXDavinci等模型。 Conclusion: 有效的提示设计和针对性的指令微调能显著提升性能，同时保持高可扩展性和适应性，使GEMMA-SQL成为实用且开放的文本到SQL解决方案。 Abstract: Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.

[12] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Dmytro Vitel,Anshuman Chhabra

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLM）中训练样本影响估计的问题，挑战了Yeh等人（2022）认为第一层（嵌入层）最适合计算影响的结论。作者提出理论和实验证据，表明所谓的“抵消效应”不可靠，并发现中间注意力层是更优的影响估计器。此外，文章探讨了跨层影响分数聚合的方法，提出基于排序和投票的方法优于标准平均。最后，作者提出了无需重新训练模型即可评估影响分数有效性的新方法，并引入噪声检测率（NDR）指标，表现出比抵消效应更强的预测能力。实验结果表明，在不同类型和规模的LLM上，第一层并不一定优于最后一层。

Details

Motivation: 准确估计训练样本对大型语言模型决策的影响对于解释模型行为和审计数据至关重要。然而，由于模型参数量巨大，现有方法通常只能在部分层上计算影响，且依赖于可能不可靠的假设（如抵消效应）。因此，需要更可靠的方法来识别关键层并改进影响估计与评估方式。 Method: 本文通过理论分析揭示了‘抵消效应’的不可靠性，并在多种类型和规模的大型语言模型上进行实验，比较不同网络层（如嵌入层、中间注意力层、末层）在影响估计中的表现。同时，探索了多种跨层聚合策略（如排名、投票机制），并提出新的评估指标——噪声检测率（NDR），以更有效地衡量影响分数的质量，避免昂贵的模型重训练。 Result: 实验结果表明，中间注意力层比第一层或最后一层更能准确估计训练样本的影响；传统的标准平均聚合方法不如基于排序和投票的替代方法；所提出的NDR指标相比抵消效应具有更强的预测能力，能更可靠地评估影响分数的有效性。 Conclusion: 该研究表明，不应默认使用第一层进行影响估计，而应关注中间注意力层，并采用更合理的聚合与评估策略。这为大型语言模型中的影响函数研究提供了新的方向，提升了影响估计的准确性与可信度。 Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.

[13] Learning to reason about rare diseases through retrieval-augmented agents

Ha Young Kim,Jun Li,Ana Beatriz Solana,Carolin M. Pirkl,Benedikt Wiestler,Julia A. Schnabel,Cosmin I. Bercea

Main category: cs.CL

TL;DR: 本文提出RADAR，一种基于检索增强的诊断推理代理系统，用于脑部MRI中的罕见病检测。该系统通过句子变换器和FAISS索引整合病例报告与文献，无需额外训练即可提升模型对罕见病的识别能力。

Details

Motivation: 由于罕见病数据稀缺，AI模型在医学影像中表现不佳。受放射科医生查阅文献的习惯启发，作者希望构建一个能利用外部知识进行诊断推理的系统。 Method: 使用AI代理结合句子变换器嵌入病例和文献，并用FAISS建立索引以实现高效相似性搜索。代理在推理时检索相关临床证据，指导诊断决策，且可与多种大语言模型集成。 Result: 在包含280种罕见病的NOVA数据集上，RADAR最多带来10.2%的性能提升，尤其显著提升了DeepSeek等开源模型的表现，并提供基于文献的可解释结果。 Conclusion: 检索增强推理是一种有效应对医学影像中低发病率疾病的范式，RADAR作为一种模型无关的推理模块，显著提升了罕见病理识别的准确性与可解释性。 Abstract: Rare diseases represent the long tail of medical imaging, where AI models often fail due to the scarcity of representative training data. In clinical workflows, radiologists frequently consult case reports and literature when confronted with unfamiliar findings. Following this line of reasoning, we introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic system for rare disease detection in brain MRI. Our approach uses AI agents with access to external medical knowledge by embedding both case reports and literature using sentence transformers and indexing them with FAISS to enable efficient similarity search. The agent retrieves clinically relevant evidence to guide diagnostic decision making on unseen diseases, without the need of additional training. Designed as a model-agnostic reasoning module, RADAR can be seamlessly integrated with diverse large language models, consistently improving their rare pathology recognition and interpretability. On the NOVA dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2% performance gain, with the strongest improvements observed for open source models such as DeepSeek. Beyond accuracy, the retrieved examples provide interpretable, literature grounded explanations, highlighting retrieval-augmented reasoning as a powerful paradigm for low-prevalence conditions in medical imaging.

[14] Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh,Simon Dobnik

Main category: cs.CL

TL;DR: 本文提出了一种基于惊异度方差的图像描述多样性度量方法，发现使用不同语言模型评估时，人类与模型生成描述的多样性关系会发生反转，强调多样性评估需结合多个评分模型以确保结论的稳健性。

Details

Motivation: 为了更准确地衡量图像描述任务中的语言多样性，现有方法可能存在偏差，因此需要一种新的、更可靠的多样性度量指标。 Method: 提出使用惊异度方差（surprisal variance）作为多样性度量，即在一组描述中词元级别负对数概率的离散程度，并在MSCOCO测试集上比较了五种先进视觉-语言大模型与人类描述的差异，采用n-gram语言模型和通用语言模型进行重评分分析。 Result: 使用专用于描述的n-gram语言模型时，人类描述的惊异度方差约为模型的两倍；但使用通用语言模型重评分时，该模式反转。表明单一评分模型可能导致完全相反的结论。 Conclusion: 图像描述多样性评估应避免依赖单一语言模型，必须结合多个评分模型进行综合判断，以确保评估结果的可靠性。 Abstract: We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

[15] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Chenxi Liu,Junjie Liang,Yuqi Jia,Bochuan Cao,Yang Bai,Heng Huang,Xun Chen

Main category: cs.CL

TL;DR: 提出ERPO框架以增强在残差提示上的探索，重新激活训练信号，提升大语言模型的推理能力。

Details

Motivation: 随着模型训练时间增长和规模扩大，越来越多的训练提示变为残差提示（奖励方差为零），导致有效训练信号减少，训练多样性下降。 Method: 提出Explore Residual Prompts in Policy Optimization (ERPO)框架，通过维护每个提示的历史记录，并对曾生成全正确响应的残差提示自适应提高采样温度，鼓励模型生成更多样化的推理路径。 Result: 在Qwen2.5系列模型上的实验表明，ERPO在多个数学推理基准上持续优于强基线方法。 Conclusion: ERPO能有效利用残差提示，恢复其训练信号，提升RLVR中大语言模型的训练效率与推理性能。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.

[16] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Preetum Nakkiran,Arwen Bradley,Adam Goliński,Eugene Ndiaye,Michael Kirchhof,Sinead Williamson

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在生成回答时的语义校准能力，发现基础LLM在开放域问答任务中具有良好的语义校准性，而RL指令微调和思维链推理会破坏这种校准性，并提出了基于词元预测机制的理论解释。

Details

Motivation: 大语言模型通常缺乏对其输出的有意义的置信度估计，尽管它们在词元级别上表现出校准性，但在语义层面是否具备类似能力尚不清楚。因此，本文旨在探究LLM是否能在语义层面上评估其回答的置信度。 Method: 提出一种基于采样的语义校准定义，引入“B-校准”概念，结合最近关于校准与局部损失最优性的理论联系，分析语义校准如何作为下一个词元预测的副产品出现，并通过实验验证理论预测。 Result: 实验证明基础LLM在多种问答任务中具有语义校准性；RL指令微调会系统性破坏该校准性；思维链推理也会破坏校准性；理论预测得到验证。 Conclusion: 本文首次从原理上解释了语义校准在LLM中何时以及为何会出现，揭示了其作为下一个词元预测训练目标的隐含产物的本质。 Abstract: Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

[17] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Matthew Bozoukov,Matthew Nguyen,Shubkarman Singh,Bart Bussmann,Patrick Leask

Main category: cs.CL

TL;DR: 研究表明，通过低秩适配器（LoRA）微调指令调优的大型语言模型，行为自我意识可在极简条件下出现，表现为任务特定、线性可分离的特征，且可通过单一 steering vector 捕获其主要行为效应。

Details

Motivation: 探究大型语言模型行为自我意识（能准确描述自身行为）出现的最小条件及其机制，以应对潜在的安全风险，例如模型在评估中隐藏真实能力。 Method: 使用低秩适配器（LoRA）对指令调优的大型语言模型进行受控微调实验，分析自我意识的诱导条件和激活空间中的表征机制。 Result: 1) 仅用一个秩为1的LoRA适配器即可可靠诱导出自我意识；2) 学到的自我意识行为主要由激活空间中的单个steering vector捕获，几乎复现全部行为变化；3) 自我意识具有非普遍性和任务局部性，不同任务间存在独立表征。 Conclusion: 行为自我意识是一种可被轻易诱导和调节的、任务特定的线性特征，其出现不依赖复杂结构，提示该能力可能以简单机制存在于模型中。 Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune's behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.

[18] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee,Sohyun Kim,Wanggeun Park,Geon Lee,Seungkyung Kim,Minyoung Lee

Main category: cs.CL

TL;DR: 本文提出了SDS KoPub VDR，首个面向韩语公共文档的大规模视觉文档检索基准，包含361个真实文档（40,781页）和600个经人工验证的查询三元组，支持文本与多模态检索任务，揭示了现有模型在跨模态推理上的不足。

Details

Motivation: 现有视觉文档检索基准忽视非英语语言和官方出版物的结构复杂性，缺乏针对韩语及复杂版式文档的评估资源。 Method: 基于361个真实韩语公共文档构建大规模基准，使用多模态模型生成600个查询-页面-答案三元组，并通过人工审核确保准确性；设计文本检索与多模态检索双任务评估框架。 Result: 建立了包含40,781页的韩语文档基准，涵盖表格、图表和多栏布局等复杂元素；实验显示当前最先进模型在多模态跨模态推理任务上存在显著性能差距。 Conclusion: SDS KoPub VDR为韩语文档理解和多模态检索提供了可靠基准，揭示了现有模型在复杂真实场景中的局限性，并为多模态AI在文档智能中的发展提供了明确方向。 Abstract: Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this critical gap, we introduce SDS KoPub VDR, the first large-scale, publicly available benchmark for retrieving and understanding Korean public documents. The benchmark is built upon a corpus of 361 real-world documents (40,781 pages), including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a challenging and reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent a rigorous human verification and refinement process to ensure factual accuracy and contextual relevance. The queries span six major public domains and are systematically categorized by the reasoning modality required: text-based, visual-based (e.g., chart interpretation), and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks that reflect distinct retrieval paradigms: (1) text-only retrieval, which measures a model's ability to locate relevant document pages based solely on textual signals, and (2) multimodal retrieval, which assesses retrieval performance when visual features (e.g., tables, charts, and layouts) are jointly leveraged alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR not only enables rigorous and fine-grained evaluation across textual and multimodal retrieval tasks but also provides a clear roadmap for advancing multimodal AI in complex, real-world document intelligence.

[19] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

Chandra Vamsi Krishna Alla,Harish Naidu Gaddam,Manohar Kommi

Main category: cs.CL

TL;DR: BudgetMem是一种新型的内存增强架构，通过选择性记忆机制和基于特征的重要性评分，在严格内存限制下实现高效长上下文处理，显著降低内存使用的同时保持高性能。

Details

Motivation: 大型语言模型在处理长上下文时面临计算和内存瓶颈，现有方法在扩展上下文窗口时带来过高资源开销，难以在资源受限环境中部署。 Method: 提出BudgetMem，结合选择性记忆策略与基于实体密度、TF-IDF、话语标记和位置偏置等特征的显著性评分，利用学习型门控机制和BM25稀疏检索，决定哪些信息在有限内存预算下值得存储。 Result: 在Llama-3.2-3B-Instruct上对700个问答对进行实验，长文档中仅损失1.0%的F1分数，内存节省72.4%，且性能优势随文档长度增加而提升。 Conclusion: BudgetMem为在低资源硬件上部署高效的长上下文语言模型提供了可行方案，有助于普及高级语言理解能力。 Abstract: Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem's benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.

[20] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li,Lehui Li,Qingmin Liao,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 提出了一种基于集体感知的增强检索框架，用于推荐科研中的基线和数据集，通过自动化数据收集、增强的嵌入模型和推理增强重排序，显著提升了实验设计自动化的准确性和可解释性。

Details

Motivation: 现有方法在数据覆盖范围和推荐相关性上存在不足，主要依赖公开资源且忽视实际论文中使用的数据集，同时过于依赖内容相似性而忽略实验适用性。 Method: 1）构建自动化数据收集流程，将约十万篇论文与其实际使用的基线和数据集关联；2）提出集体感知增强的检索器，结合自描述与引用上下文表示，并微调嵌入模型以提高召回效率；3）开发推理增强的重排序器，利用交互链构造显式推理链，并微调大模型生成可解释的理由和优化排序。 Result: 所构建的数据集覆盖了过去五年顶级AI会议中85%的基线和数据集；在Recall@20上提升+5.85%，HitRate@5上提升+8.30%，优于最强基线方法。 Conclusion: 该方法显著推进了实验设计自动化在可靠性与可解释性方面的性能，为LLM代理在科学探索中的应用提供了有效支持。 Abstract: Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

[21] Diagnosing and Mitigating Semantic Inconsistencies in Wikidata's Classification Hierarchy

Shixiong Zhao,Hideaki Takeda

Main category: cs.CL

TL;DR: 本研究提出了一种新的验证方法，用于检测Wikidata中特定领域的分类错误、过度泛化的子类链接和冗余连接，并引入了新的评估标准来判断这些问题是否需要修正。

Details

Motivation: 由于Wikidata相对宽松的编辑策略导致了分类上的不一致性，因此需要一种有效的方法来识别和纠正这些结构问题。 Method: 基于先前的研究，提出并应用一种新颖的验证方法，结合新设计的评估标准，开发一个系统以检查任意Wikidata实体的分类关系。 Result: 确认了Wikidata特定领域中存在分类错误、过度泛化和冗余链接，并通过新系统实现了对这些问题的有效审查。 Conclusion: 该方法能够有效揭示Wikidata中的结构性质量问题，并利用其众包特性促进数据质量的持续改进。 Abstract: Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform's crowdsourced nature to its full potential.

[22] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Wei Shao,Lingchao Zheng,Pengyu Wang,Peizhen Zheng,Jun Li,Yuwei Fan

Main category: cs.CL

TL;DR: 提出了一种名为LoPT的无损并行分词框架，通过基于字符位置的匹配和动态调整块长度，解决了长文本并行分词中的边界伪影问题，保证了与顺序分词一致的结果，同时显著提升了速度。

Details

Motivation: 在长上下文推理中，分词成为被忽视的瓶颈，并行分词方法因合并时的边界伪影导致结果不一致，亟需一种既能加速又能保证一致性的分词方法。 Method: 提出LoPT框架，采用基于字符位置的匹配策略和动态块长度调整机制，精确对齐和合并分词片段，确保与标准顺序分词结果完全一致。 Result: 在多种长文本数据集上实验表明，LoPT在保证无损分词的前提下显著加速处理过程，且理论证明了其一致性，并验证了方法的鲁棒性。 Conclusion: LoPT有效解决了并行分词中的边界问题，实现了高效且无损的长文本分词，为大模型长序列推理提供了可靠支持。 Abstract: Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.

[23] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Zihao Yi,Qingxuan Jiang,Ruotian Ma,Xingyu Chen,Qu Yang,Mengru Wang,Fanghua Ye,Ying Shen,Zhaopeng Tu,Xiaolong Li,Linus

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型在扮演道德模糊或反派角色时的表现，发现安全对齐机制显著降低了其角色扮演的保真度，尤其在体现欺骗性和操控性等特质时表现不佳。

Details

Motivation: 现代大模型经过安全对齐训练，可能难以真实模拟非亲社会或反派角色，限制了其在创意生成中的应用。本文旨在揭示这一冲突。 Method: 提出Moral RolePlay基准，包含四层道德对齐量表和平衡测试集，评估多个先进大模型在从正面到极端反派角色扮演任务中的表现。 Result: 实验显示，随着角色道德水平下降，模型角色扮演保真度呈单调递减；安全对齐程度越高的模型，在反派角色扮演中表现越差，且倾向于用表面攻击性替代复杂的恶意行为。 Conclusion: 模型的安全对齐与创造性角色扮演存在根本张力，需发展更精细、情境感知的对齐方法以平衡二者。 Abstract: Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

[24] Acquiring Common Chinese Emotional Events Using Large Language Model

Ya Wang,Guangzheng Zhu,Cungen Cao,Jingjing Li,He Li,Xin Huang

Main category: cs.CL

TL;DR: 本文提出了一种从中文大语言模型中生成并筛选高质量常见情感事件的方法，并构建了首个大规模中文情感事件常识知识库（共102,218条），标注了情感极性，验证了其在情感原因提取任务中的应用潜力。

Details

Motivation: 情感事件知识对多种应用具有重要意义，但上下文无关的通用中文情感事件难以获取，缺乏相关资源限制了中文情感理解任务的发展。 Method: 首先收集中文情感事件指示词，利用这些指示词提示中文大语言模型生成情感事件；接着训练一个过滤器以去除无效结果，并采用不同技术对事件进行正负情感分类。 Result: 构建了包含102,218个高质量常见中文情感事件的知识库，每个事件均标注情感极性；内在评估表明该方法能有效获取常见情感事件，外在用例显示其在情感原因提取任务中具有应用价值。 Conclusion: 该方法能高效构建大规模中文情感事件知识库，为中文情感分析任务提供了有价值的资源，并展示了其在实际任务中的潜在用途。 Abstract: Knowledge about emotional events is an important kind of knowledge which has been applied to improve the effectiveness of different applications. However, emotional events cannot be easily acquired, especially common or generalized emotional events that are context-independent. The goal of this paper is to obtain common emotional events in Chinese language such as "win a prize" and "be criticized". Our approach begins by collecting a comprehensive list of Chinese emotional event indicators. Then, we generate emotional events by prompting a Chinese large language model (LLM) using these indicators. To ensure the quality of these emotional events, we train a filter to discard invalid generated results. We also classify these emotional events as being positive events and negative events using different techniques. Finally, we harvest a total of 102,218 high-quality common emotional events with sentiment polarity labels, which is the only large-scale commonsense knowledge base of emotional events in Chinese language. Intrinsic evaluation results show that the proposed method in this paper can be effectively used to acquire common Chinese emotional events. An extrinsic use case also demonstrates the strong potential of common emotional events in the field of emotion cause extraction (ECE). Related resources including emotional event indicators and emotional events will be released after the publication of this paper.

[25] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Prasoon Varshney,Makesh Narsimhan Sreedhar,Liwei Jiang,Traian Rebedea,Christopher Parisien

Main category: cs.CL

TL;DR: 本文提出了PBSUITE，一个用于评估大语言模型在多轮交互中遵守多样化行为规范能力的动态评估套件，揭示了现有模型在对抗性多轮对话中对多元化对齐规范的遵守显著下降。

Details

Motivation: 现实世界中LLM的应用需适应不同组织的政策、法规和价值观，而传统对齐方法难以满足这种多元化需求，因此需要系统评估模型在多样化对齐目标下的表现。 Method: 构建包含300个真实行为规范的数据集（涵盖30个行业）和一个动态评估框架，在多轮对抗性对话中测试模型对定制化行为规范的遵守情况。 Result: 实验发现主流开源和闭源LLM在单轮对话中违规率低于4%，但在多轮对抗场景中违规率高达84%，表明现有对齐方法在复杂交互中效果有限。 Conclusion: 现有LLM对齐技术在支持多元化、情境化应用方面存在不足，PBSUITE为未来研究提供了数据和框架基础，推动更鲁棒和上下文感知的对齐方法发展。 Abstract: Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs' capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.

[26] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

Mykyta Syromiatnikov,Victoria Ruvinskaya

Main category: cs.CL

TL;DR: 本文提出了一个名为UA-Code-Bench的新基准，用于评估大语言模型在乌克兰语中的代码生成和竞赛编程问题解决能力，揭示了即使最先进的模型在低资源语言中仍面临重大挑战。

Details

Motivation: 现有基准多集中于从英语翻译的任务或仅评估简单的语言理解，缺乏对低资源语言中模型真实能力的全面评估。 Method: 构建了一个包含500个来自Eolymp平台的问题的基准，涵盖五个难度等级，并在专用Eolymp环境中对13种主流闭源和开源模型生成的Python解决方案进行隐藏测试评估。 Result: 即使是表现最好的模型（如OpenAI o3和GPT-5）也仅能解决约一半的问题，且研究还分析了不同难度下的性能、解的唯一性以及运行时间和内存消耗等效率指标。 Conclusion: 竞争性编程基准对于评估大语言模型在低资源语言中的能力具有重要价值，该工作为多语言代码生成和增强推理模型的研究提供了基础。 Abstract: Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models' code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at https://huggingface.co/datasets/NLPForUA/ua-code-bench.

[27] Order-Level Attention Similarity Across Language Models: A Latent Commonality

Jinglin Liang,Jin Zhong,Shuangping Huang,Yunqing Hu,Huiyuan Zhang,Huifang Li,Lixin Fan,Hanlin Gu

Main category: cs.CL

TL;DR: 本文提出了Order-Level Attention (OLA)，揭示了不同语言模型在相同阶数下的OLA具有显著相似性，并发现OLA与句法知识之间存在隐式映射。基于此，作者提出了一种无需训练的跨语言模型适配方法Transferable OLA Adapter (TOA)，通过将OLA作为统一的句法特征表示实现适配器在未见模型上的泛化。

Details

Motivation: 现有研究多关注单个语言模型或注意力头的上下文聚合模式，缺乏跨多个语言模型的系统性分析。本文旨在探索不同语言模型间上下文聚合模式的共性，以加深对语言模型的理解并促进跨模型知识迁移。 Method: 引入基于Attention Rollout的逐阶分解得到的Order-Level Attention (OLA)，分析多个语言模型间的OLA相似性，并探究其与句法知识的关系；在此基础上设计无需训练的跨模型适配器TOA，利用OLA作为输入进行统一建模。 Result: 实验表明，不同语言模型在相同阶数的OLA上表现出高度相似性，且OLA与句法结构存在隐式对应；所提出的TOA方法在无需参数更新的情况下，在未见过的语言模型上仍能有效提升性能。 Conclusion: 语言模型在上下文聚合过程中存在共性的Order-Level Attention模式，该模式可作为统一的句法特征用于构建无需训练的跨模型适配器，为模型泛化和知识迁移提供了新途径。 Abstract: In this paper, we explore an important yet previously neglected question: Do context aggregation patterns across Language Models (LMs) share commonalities? While some works have investigated context aggregation or attention weights in LMs, they typically focus on individual models or attention heads, lacking a systematic analysis across multiple LMs to explore their commonalities. In contrast, we focus on the commonalities among LMs, which can deepen our understanding of LMs and even facilitate cross-model knowledge transfer. In this work, we introduce the Order-Level Attention (OLA) derived from the order-wise decomposition of Attention Rollout and reveal that the OLA at the same order across LMs exhibits significant similarities. Furthermore, we discover an implicit mapping between OLA and syntactic knowledge. Based on these two findings, we propose the Transferable OLA Adapter (TOA), a training-free cross-LM adapter transfer method. Specifically, we treat the OLA as a unified syntactic feature representation and train an adapter that takes OLA as input. Due to the similarities in OLA across LMs, the adapter generalizes to unseen LMs without requiring any parameter updates. Extensive experiments demonstrate that TOA's cross-LM generalization effectively enhances the performance of unseen LMs. Code is available at https://github.com/jinglin-liang/OLAS.

Manan Sharma,Arya Suneesh,Manish Jain,Pawan Kumar Rajpoot,Prasanna Devadiga,Bharatdeep Hazarika,Ashish Shrivastava,Kishan Gurumurthy,Anshuman B Suresh,Aditya U Baliga

Main category: cs.CL

TL;DR: 提出了一种基于Who、What、Where、When、Why和How问题分解的多语言声明规范化方法，仅使用英语数据训练即可实现跨语言迁移，在20种语言中表现出良好的泛化能力。

Details

Motivation: 为了解决社交媒体中多语言虚假信息检测中的声明规范化问题，将嘈杂的非正式帖子转化为清晰、可验证的陈述，以支持跨语言的事实核查。 Method: 采用LoRA微调Qwen3-14B模型，结合帖内去重、基于token-level recall的语义对齐过滤，并在推理阶段引入检索增强的少样本学习与上下文示例。通过六要素（Who, What, Where, When, Why, How）对帖子进行系统性分解，实现跨语言迁移。 Result: 在METEOR评分上较基线方法相对提升41.3%，英语得分为41.16，荷兰语和旁遮普语排名第四，英语排名第三；对罗曼语族和日耳曼语族语言具有良好的跨语言泛化能力，但马拉地语表现较差（15.21）。 Conclusion: 该方法无需目标语言标注数据即可实现有效的多语言声明规范化，验证了基于问题分解的结构化建模在跨语言虚假信息检测中的潜力。 Abstract: We address claim normalization for multilingual misinformation detection - transforming noisy social media posts into clear, verifiable statements across 20 languages. The key contribution demonstrates how systematic decomposition of posts using Who, What, Where, When, Why and How questions enables robust cross-lingual transfer despite training exclusively on English data. Our methodology incorporates finetuning Qwen3-14B using LoRA with the provided dataset after intra-post deduplication, token-level recall filtering for semantic alignment and retrieval-augmented few-shot learning with contextual examples during inference. Our system achieves METEOR scores ranging from 41.16 (English) to 15.21 (Marathi), securing third rank on the English leaderboard and fourth rank for Dutch and Punjabi. The approach shows 41.3% relative improvement in METEOR over baseline configurations and substantial gains over existing methods. Results demonstrate effective cross-lingual generalization for Romance and Germanic languages while maintaining semantic coherence across diverse linguistic structures.

[29] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

P. Bilha Githinji,Aikaterini Meilliou,Peiwu Qin

Main category: cs.CL

TL;DR: 本研究评估了两种通用大语言模型（Mistral 24B和QWen2.5 32B）在生物医学文本简化任务中的表现，发现指令调优的Mistral在可读性提升与话语保真度之间表现出更优平衡。

Details

Motivation: 公众对生物医学信息的需求增加，亟需可扩展的自动化文本简化方案，以将复杂科学内容转化为通俗语言。 Method: 采用对比分析方法，评估Mistral 24B和QWen2.5 32B在可读性、话语保真度等多指标上的表现，并进行21项指标的相关性分析以获取机制洞察。 Result: Mistral 24B在SARI得分（均值42.46）和BERTScore（0.91）上优于QWen（BERTScore 0.89），显示出更好的词汇简化策略和话语保留能力；同时发现五种可读性指数存在功能冗余。 Conclusion: 指令调优架构可能更具优势，Mistral 24B是当前文本简化的有力候选模型，且词汇支持是领域适配的关键挑战。 Abstract: The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.

[30] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Grigory Kovalev,Mikhail Tikhomirov

Main category: cs.CL

TL;DR: 提出了一种改进的基于ShortGPT的迭代蒸馏方法，通过评估并移除重要性较低的Transformer层，并结合KL散度和均方误差的联合损失进行微调，成功将Qwen2.5-3B模型从36层压缩至28层甚至24层，在仅造成9.7%和18%性能损失的同时显著降低参数量，验证了中间层冗余性和该压缩方法的有效性。

Details

Motivation: 为了在保持大语言模型高性能的同时减少模型规模，以适应资源受限环境的部署需求，亟需有效的模型压缩方法。现有蒸馏技术存在对层重要性评估不足或训练策略不够优化的问题，因此需要一种更精细的迭代式蒸馏策略来提升压缩效率。 Method: 基于ShortGPT方法，提出一种改进的迭代蒸馏框架：每一步通过移除单个Transformer层并测量在代表性数据集上的性能下降来量化层的重要性；根据重要性逐步剪枝非关键层；随后对精简后的模型使用KL散度与均方误差构成的联合损失函数进行进一步微调，以恢复和提升性能。 Result: 在Qwen2.5-3B模型上实验表明，可将其层数从36层压缩至28层（参数降至2.47B）和24层，分别仅造成9.7%和18%的质量损失；同时发现中间Transformer层对推理贡献较小，表现出较高的冗余性。 Conclusion: 所提出的迭代式层重要性评估与联合损失微调相结合的方法能有效压缩大语言模型，在显著减少参数量的同时较好地保留模型性能，适用于资源受限场景下的高效部署，具备实际应用潜力。 Abstract: This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.

[31] A Toolbox for Improving Evolutionary Prompt Search

Daniel Grießhaber,Maximilian Kimmich,Johannes Maucher,Ngoc Thang Vu

Main category: cs.CL

TL;DR: 提出了一种改进的进化式提示优化方法，通过分解进化步骤、引入基于LLM的评判器、融合人类反馈和更高效的评估策略，提升了优化质量和效率。

Details

Motivation: 现有进化式提示优化方法缺乏强大的算子和高效评估机制，限制了其在大语言模型提示优化中的效果和应用。 Method: 将进化过程分解为独立步骤，引入基于LLM的评判器验证进化结果，结合人类反馈优化进化算子，并设计更高效的评估策略以减少计算开销。 Result: 所提方法在保持性能的同时显著提高了提示优化的质量和效率，并开源了代码以支持新任务和后续研究。 Conclusion: 改进的进化式提示优化框架有效提升了提示优化的控制性、准确性和效率，具有较好的通用性和研究推广价值。 Abstract: Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.

[32] ManufactuBERT: Efficient Continual Pretraining for Manufacturing

Robin Armingaud,Romaric Besançon

Main category: cs.CL

TL;DR: 本文提出了ManufactuBERT，一种在大规模制造领域语料库上持续预训练的RoBERTa模型，通过精心设计的数据去重流程显著提升了模型性能并减少了33%的训练时间和计算成本。

Details

Motivation: 通用Transformer模型在专业领域（如制造）表现不佳，因其缺乏对领域术语和语义的接触，因此需要专门针对制造领域优化的语言模型。 Method: 构建了一个面向制造领域的大型语料库，采用领域过滤和多阶段去重的数据处理流程，并在此基础上对RoBERTa模型进行持续预训练。 Result: ManufactuBERT在多个制造相关的NLP任务上达到了新的SOTA水平，且在去重后的数据上训练使收敛速度加快33%，显著降低训练成本。 Conclusion: 该研究展示了领域特定持续预训练的有效性，所提出的数据处理 pipeline 可为其他专业领域高性能编码器的开发提供可复现的范例。 Abstract: While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33\% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at https://huggingface.co/cea-list-ia.

[33] Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

Jan-Thorsten Peter,David Vilar,Tobias Domhan,Dan Malkin,Markus Freitag

Main category: cs.CL

TL;DR: 本文研究了多语言大模型在数学任务上的跨语言性能差异，发现先前报告的语言差距主要源于数据翻译错误和答案提取不一致；通过提出自动质量保证方法并改进答案提取，语言差距基本消失，并发布了修正后的数据集。

Details

Motivation: 研究当前大语言模型在不同语言间性能差异的真实性，尤其是高资源与低资源语言在数学领域表现差异的原因。 Method: 分析标准多语言数学基准MGSM中的翻译错误，提出自动化的质量保证方法来检测和修正错误，并建议标准化的答案提取方式以提高评估一致性。 Result: 发现多个翻译错误和非标准化答案提取严重影响模型性能评估；采用修正后的方法，原先报告的跨语言性能差距基本消失。 Conclusion: 先前观察到的跨语言性能差距主要由数据质量问题导致，而非模型本身能力差异；强调高质量、标准化评测对多语言模型评估的重要性。 Abstract: Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren't for the fact that they are false! By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.

[34] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Cong-Thanh Do,Rama Doddipatla,Kate Knill

Main category: cs.CL

TL;DR: 本文研究了在知识蒸馏中使用思维链（CoT）提示来提升小型语言模型推理能力的有效性，实验表明CoT能显著提高蒸馏模型在复杂自然语言推理任务上的性能。

Details

Motivation: 为了提升小型语言模型的推理能力，探索如何通过思维链提示在白盒知识蒸馏中有效迁移大模型的推理能力。 Method: 采用Qwen和Llama2系列的大模型进行白盒知识蒸馏，利用CoT-Collection数据集中的思维链数据，在BIG-Bench-Hard（BBH）基准的任务上评估蒸馏后模型的表现。 Result: 实验结果显示，使用CoT进行知识蒸馏显著提升了小型模型在BBH多项自然语言推理与理解任务上的平均性能。 Conclusion: CoT在白盒知识蒸馏中能有效传递大模型的推理能力，显著增强小型模型在复杂推理任务中的表现。 Abstract: Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

[35] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Zilong Li,Jie Cao

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的注释管道，用于解决古典汉文到日文翻译中的低资源问题，并通过引入辅助中文NLP任务提升序列标注性能。

Details

Motivation: 由于古典汉文到日文的注释翻译数据稀缺，面临低资源挑战，因此需要开发有效的方法来提升此类翻译系统的性能。 Method: 将古代注释过程抽象为序列标注任务，构建基于大语言模型的注释管道，并利用数字化开源数据构建新数据集；同时引入辅助中文NLP任务以提升主任务训练效果。 Result: 在低资源设置下，引入辅助中文NLP任务能有效提升序列标注任务的训练效果；大语言模型在直接翻译上表现良好，但在字符级注释任务上表现不佳。 Conclusion: 所提出的方法可作为大语言模型在古典汉文注释与翻译任务中的有效补充，尤其适用于低资源场景。 Abstract: Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.

[36] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Teqi Hao,Xioayu Tan,Shaojie Shi,Yinghui Xu,Xihe Qiu

Main category: cs.CL

TL;DR: 提出Reflective Personalization Optimization (RPO)框架，通过将内容生成与个性化对齐解耦，先生成通用响应再由外部反思模块重写以匹配用户偏好，显著优于现有方法。

Details

Motivation: 现有基于上下文注入的LLM个性化方法在生成准确内容和保持用户风格之间存在权衡，难以同时保证输出质量和精确控制。 Method: RPO分为两阶段：首先由基础模型生成高质量通用回复；然后通过外部反思模块显式重写该回复以符合用户偏好。反思模块先通过结构化重写轨迹进行监督微调，再用强化学习优化个性化输出质量。 Result: 在LaMP基准上的实验表明，RPO显著优于当前最先进基线方法，验证了显式响应重塑优于隐式上下文注入的有效性。 Conclusion: RPO提供了一种高效、模型无关的个性化层，可无缝集成到任何基础模型，为用户中心化的生成任务开辟了新方向。 Abstract: The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

[37] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

Shreya Gupta,Ojasva Saxena,Arghodeep Nandi,Sarah Masud,Kiran Garimella,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出一种新的细粒度叙事框架标注方法，通过微调BERT模型将叙事框架与对话中的具体实体关联，以更准确地分析播客中的叙事结构，并揭示话题与叙述方式之间的系统性关系。

Details

Motivation: 播客的非正式、多主题和对话式特点使其成为理解当代舆论的重要资源，但现有大语言模型难以捕捉人类用于识别叙事框架的细微线索，导致自动化分析效果不佳。 Method: 开发并评估一个微调的BERT模型，该模型将叙事框架显式地链接到对话中提到的具体实体，并利用这些细粒度的框架标签与高层次主题进行关联分析。 Result: 所提方法在识别播客叙事框架方面比现有模型更贴近人类判断，并能有效揭示话题与叙述方式之间的系统性关联。 Conclusion: 该研究为分析复杂口语化文本提供了更可靠的框架，增强了对数字媒体中影响力传播机制的理解。 Abstract: Podcasts have become a central arena for shaping public opinion, making them a vital source for understanding contemporary discourse. Their typically unscripted, multi-themed, and conversational style offers a rich but complex form of data. To analyze how podcasts persuade and inform, we must examine their narrative structures -- specifically, the narrative frames they employ. The fluid and conversational nature of podcasts presents a significant challenge for automated analysis. We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture the subtle cues that human listeners rely on to identify narrative frames. As a result, current approaches fall short of accurately analyzing podcast narratives at scale. To solve this, we develop and evaluate a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in the conversation, effectively grounding the abstract frame in concrete details. Our approach then uses these granular frame labels and correlates them with high-level topics to reveal broader discourse trends. The primary contributions of this paper are: (i) a novel frame-labeling methodology that more closely aligns with human judgment for messy, conversational data, and (ii) a new analysis that uncovers the systematic relationship between what is being discussed (the topic) and how it is being presented (the frame), offering a more robust framework for studying influence in digital media.

[38] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Klára Bendová,Tomáš Knap,Jan Černý,Vojtěch Pour,Jaromir Savelka,Ivana Kvapilíková,Jakub Drápal

Main category: cs.CL

TL;DR: 本文研究了从斯洛伐克公开法院判决中提取犯罪行为描述的可行性，比较了正则表达式和大语言模型（LLM）两种方法。结果表明，先进正则表达式和LLM均显著优于基线方法，结合使用时效果最佳。

Details

Motivation: 刑事司法行政数据对犯罪行为的信息记录有限，而大陆法系法院判决书中包含丰富的犯罪行为描述，亟需有效方法进行系统提取和利用。 Method: 采用两种方法提取判决书中的犯罪行为描述：一是基于“sparing”及其空格规范化特征的高级正则表达式；二是使用Gemini Flash 2.0大语言模型进行提示抽取，并与基线正则方法对比。 Result: 基线方法仅在40.5%的判决中成功识别描述，而高级正则表达式达到97%，LLM达到98.75%，二者结合达99.5%；人工评估显示，高级方法与人类标注匹配率约90%，基线仅为34.5%；LLM完全匹配率达91.75%，结合方法达92%。 Conclusion: 高级正则表达式和大语言模型能高效准确地从法院判决中提取犯罪行为描述，结合使用可进一步提升性能，为刑事司法数据分析提供了可行且高效的技术路径。 Abstract: Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts' decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on "sparing" and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.

[39] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Firoj Ahmmed Patwary,Abdullah Al Noman

Main category: cs.CL

TL;DR: 本文提出了一种专为孟加拉语设计的字节对编码（BPE）分词器BengaliBPE，通过Unicode归一化、音素级初始化和形态感知的合并规则，提升了对形态丰富语言的分词效果，在保持较高计算开销的同时实现了更细粒度的分词和更好的形态可解释性。

Details

Motivation: 现有的子词分词器主要针对拉丁或多种语言语料库设计，在处理如孟加拉语等形态丰富的语言时表现不佳，因此需要一种专门针对孟加拉语脚本优化的分词方法。 Method: 提出BengaliBPE，采用Unicode归一化、图素级初始化和形态感知的合并规则，并在大规模孟加拉语新闻分类数据集上与Whitespace、SentencePiece BPE和HuggingFace BPE进行对比评估。 Result: BengaliBPE在分词粒度和形态可解释性方面表现最佳，尽管计算成本略高，但在下游分类任务中表现出有竞争力的性能。 Conclusion: 语言特定的分词策略对于形态丰富的语言至关重要，BengaliBPE为未来的孟加拉语NLP系统（包括大规模预训练模型）提供了坚实的基础。 Abstract: Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.

[40] A multimodal multiplex of the mental lexicon for multilingual individuals

Maria Huynh,Wilder C. Rodrigues

Main category: cs.CL

TL;DR: 本研究探讨多语言者心理词汇库的结构，特别是遗产语言如何影响新语言的学习，并通过引入视觉输入的多模态模型来扩展现有框架。

Details

Motivation: 传统上认为双语会增加认知负担，但近年来研究发现多语言者在语言和认知任务中表现更优。为了理解多语言者词汇识别系统的工作机制，需要建立更精确的心理词汇模型。 Method: 基于Stella等人（2018）的多重网络模型和Dijkstra与van Heuven（2002）的BIA+框架，采用多层网络方法并引入视觉输入层，构建多模态的多语言心理词汇模型，通过翻译任务比较视觉加文本与纯文本条件下的表现差异。 Result: 尚未报告实验结果，研究仍处于提案阶段，主要提出了一种新的多模态多语言心理词汇模型及其验证方法。 Conclusion: 该研究有望揭示视觉输入对多语言词汇处理的影响，以及遗产语言在新语言学习中的作用，为多语言认知模型提供新的理论支持。 Abstract: Historically, bilingualism was often perceived as an additional cognitive load that could hinder linguistic and intellectual development. However, over the last three decades, this view has changed considerably. Numerous studies have aimed to model and understand the architecture of the bilingual word recognition system Dijkstra and van Heuven (2002), investigating how parallel activation operates in the brain and how one language influences another Kroll et al. (2015). Increasingly, evidence suggests that multilinguals, individuals who speak three or more languages, can perform better than monolinguals in various linguistic and cognitive tasks, such as learning an additional language Abu-Rabia and Sanitsky (2010). This research proposal focuses on the study of the mental lexicon and how it may be structured in individuals who speak multiple languages. Building on the work of Stella et al. (2018), who investigated explosive learning in humans using a multiplex model of the mental lexicon, and the Bilingual Interactive Activation (BIA+) framework proposed by Dijkstra and van Heuven (2002), the present study applies the same multilayer network principles introduced by Kivela et al. (2014). Our experimental design extends previous research by incorporating multimodality into the multiplex model, introducing an additional layer that connects visual inputs to their corresponding lexical representations across the multilingual layers of the mental lexicon. In this research, we aim to explore how a heritage language influences the acquisition of another language. Specifically, we ask: Does the presence of visual input in a translation task influence participants' proficiency and accuracy compared to text-only conditions?

[41] Large Language Models for Explainable Threat Intelligence

Tiago Dinis,Miguel Correia,Roger Tavares

Main category: cs.CL

TL;DR: 本文提出了一种结合检索增强生成（RAG）的大语言模型系统RAGRecon，用于获取威胁情报，并通过生成知识图谱提升AI的可解释性。

Details

Motivation: 传统安全机制难以应对日益复杂的网络威胁，需要更智能、可解释的方法来辅助威胁情报分析。 Method: 采用大语言模型结合检索增强生成（RAG）技术构建RAGRecon系统，并为每个回答生成可视化的知识图谱以增强可解释性。 Result: 在两个数据集上使用七个不同大语言模型进行实验，最佳组合的回复与参考答案匹配率超过91%。 Conclusion: RAGRecon能有效结合实时检索与领域知识，提供准确且可解释的网络安全威胁问答，提升了AI在网络安全中的透明度和实用性。 Abstract: As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.

[42] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

Yahui Fu,Zi Haur Pang,Tatsuya Kawahara

Main category: cs.CL

TL;DR: 提出了一种统一框架，用于建模个体和群体层面的用户满意度偏好，通过个性化推理链和无监督聚类方法提升对话系统中少数用户群体的满意度估计性能。

Details

Motivation: 现有对齐方法通常忽略少数用户群体的个体意图和偏好差异，导致用户满意度估计不够准确。 Method: 提出了Chain-of-Personalized-Reasoning（CoPeR）捕捉个体偏好，设计基于期望最大化算法的Majority-Minority Preference-Aware Clustering（M2PC）进行无监督用户分组，并构建偏好自适应强化学习框架PAda-PPO联合优化个体与群体偏好。 Result: 在情感支持对话数据集上的实验表明，该方法在用户满意度估计方面表现优于现有方法，尤其显著提升了对少数或代表性不足用户群体的估计效果。 Conclusion: 所提出的框架能有效兼顾个体与群体偏好，改善对话系统对多样化用户的满意度预测能力，增强模型的公平性与适应性。 Abstract: User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M2PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

[43] Steering Language Models with Weight Arithmetic

Constanza Fierro,Fabien Roger

Main category: cs.CL

TL;DR: 提出了一种名为对比权重引导（contrastive weight steering）的后训练方法，通过权重算术调整模型参数，以在狭窄训练数据下实现更强的行为控制，并减少微调过程中的不良行为漂移。

Details

Motivation: 在狭窄分布上提供反馈可能导致意外泛化，而在多样化数据上获取高质量反馈成本高昂。因此需要一种能更好利用窄域数据的方法来精确控制模型行为。 Method: 通过对两个小型微调（一个诱导目标行为，另一个诱导相反行为）的权重变化进行相减，得到行为方向，再通过加减该方向来修改模型权重，从而实现对模型行为的调控。 Result: 权重引导比激活引导具有更强的跨分布行为控制能力，在不损害通用能力的前提下有效缓解了奉承（sycophancy）和错位行为；在任务微调中可减轻不良行为漂移，同时保持任务性能提升；初步证据表明可通过检测微调更新与“恶意”权重方向的相似性来发现潜在错位。 Conclusion: 对比权重引导是一种有效的后训练模型编辑方法，能够在有限数据下实现对模型行为的精细控制，具备较强的泛化性和应用潜力，尤其适用于监测和纠正训练过程中未显现的罕见错位行为。 Abstract: Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

[44] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

Yuexin Wu,Shiqi Wang,Vasile Rus

Main category: cs.CL

TL;DR: 本文提出了MIMIC-SR-ICD11数据集和LL-Rank重排序框架，用于基于临床报告的疾病诊断，显著优于生成式基线方法。

Details

Motivation: 电子健康记录常忽略患者自述中的重要细节，而这些细节对疾病诊断至关重要，因此需要更有效的诊断建模方法。 Method: 构建了与ICD-11对齐的MIMIC-SR-ICD11数据集，并提出LL-Rank框架，通过计算标签在上下文中的归一化联合似然并减去无报告先验似然进行重排序。 Result: 在七个模型主干上，LL-Rank consistently优于GenMap基线方法，消融实验表明其性能提升主要来自基于PMI的打分机制。 Conclusion: LL-Rank能有效提升基于自述文本的多标签诊断性能，缓解标签频率偏差，具有临床应用潜力。 Abstract: Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank's gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.

cs.CV [Back]

[45] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz,Akash,Shaharukh Khan,Raja Kolla,Akshat Patidar,Suranjan Goswami,Abhinav Ravi,Chandra Khatri,Shubham Agarwal

Main category: cs.CV

TL;DR: 本文提出了IndicVisionBench，首个聚焦印度次大陆的大规模多语言视觉-语言基准，涵盖13个文化主题、10种印度语言及英语，包含约5K图像和37K+问答对，用于评估视觉语言模型在文化多样性和多语言环境下的表现。

Details

Motivation: 现有视觉语言模型的评测基准大多以西方为中心，缺乏对文化多样性和多语言场景的充分评估，因此需要一个专注于非西方文化的基准来揭示模型在真实多元环境中的局限性。 Method: 构建了一个包含光学字符识别（OCR）、多模态机器翻译（MMT）和视觉问答（VQA）三大任务的基准，覆盖10种印度语言和英语，涉及6类问题类型，并发布了一个跨10种印度语言的平行注释语料库。 Result: 在8个不同类型的模型上进行了实验，结果表明当前视觉语言模型在印度语言和文化背景下存在显著性能差距，尤其在低资源语言上表现较差。 Conclusion: IndicVisionBench为评估视觉语言模型在文化多样性和多语言环境中的表现提供了可复现的框架，推动更包容的多模态研究发展。 Abstract: Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

[46] Knowledge-based anomaly detection for identifying network-induced shape artifacts

Rucha Deshpande,Tahsin Rahman,Miguel Lago,Adarsh Subbaswamy,Jana G. Delfino,Ghada Zamzmi,Elim Thompson,Aldo Badano,Seyed Kahaki

Main category: cs.CV

TL;DR: 提出一种基于知识的异常检测方法，用于检测合成图像中的网络诱导形状伪影，该方法在两个合成乳腺X线数据集上表现出色，AUC值分别为0.97和0.91，并得到人类阅片者的验证。

Details

Motivation: 合成数据虽能缓解机器学习模型训练中的数据稀缺问题，但可能引入伪影和不真实特征，影响模型性能和临床应用，因此需要有效的质量评估方法。 Method: 采用两阶段框架：首先通过分析解剖边界上的角度梯度分布构建专用特征空间，然后使用孤立森林进行异常检测。 Result: 在CSAW-syn和VMLO-syn两个数据集上，该方法将伪影集中在最异常的1%样本中，AUC分别为0.97和0.91；阅片者研究显示人机判断具有一致性，平均一致率分别为66%和68%，Kendall-Tau相关系数为0.45和0.43。 Conclusion: 该方法有助于负责任地使用合成数据，使开发者能够评估合成图像是否符合解剖学约束，并定位和改进特定问题以提升合成数据集的整体质量。 Abstract: Synthetic data provides a promising approach to address data scarcity for training machine learning models; however, adoption without proper quality assessments may introduce artifacts, distortions, and unrealistic features that compromise model performance and clinical utility. This work introduces a novel knowledge-based anomaly detection method for detecting network-induced shape artifacts in synthetic images. The introduced method utilizes a two-stage framework comprising (i) a novel feature extractor that constructs a specialized feature space by analyzing the per-image distribution of angle gradients along anatomical boundaries, and (ii) an isolation forest-based anomaly detector. We demonstrate the effectiveness of the method for identifying network-induced shape artifacts in two synthetic mammography datasets from models trained on CSAW-M and VinDr-Mammo patient datasets respectively. Quantitative evaluation shows that the method successfully concentrates artifacts in the most anomalous partition (1st percentile), with AUC values of 0.97 (CSAW-syn) and 0.91 (VMLO-syn). In addition, a reader study involving three imaging scientists confirmed that images identified by the method as containing network-induced shape artifacts were also flagged by human readers with mean agreement rates of 66% (CSAW-syn) and 68% (VMLO-syn) for the most anomalous partition, approximately 1.5-2 times higher than the least anomalous partition. Kendall-Tau correlations between algorithmic and human rankings were 0.45 and 0.43 for the two datasets, indicating reasonable agreement despite the challenging nature of subtle artifact detection. This method is a step forward in the responsible use of synthetic data, as it allows developers to evaluate synthetic images for known anatomic constraints and pinpoint and address specific issues to improve the overall quality of a synthetic dataset.

[47] CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu,Ming Li,Xinxin Liu,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了条件偏好优化（CPO），通过在控制信号上进行偏好学习而非生成图像，解决了文本到图像生成中可控性优化的难题，相比ControlNet++和DPO方法，在多个控制类型上显著提升了可控性且降低了计算开销。

Details

Motivation: 现有方法如ControlNet++因仅优化低噪声步而忽略高噪声步，且引入近似误差；DPO则因生成图像的不确定性难以分离可控性与其他因素的影响，因此需要一种更有效、低方差的可控性优化方法。 Method: 提出条件偏好优化（CPO），构建优劣控制信号对（c^w, c^l），在控制条件上进行偏好学习，使模型偏好更强的控制信号，从而避免图像生成中的混杂因素影响。 Result: CPO在理论上具有比DPO更低的对比损失方差，实验显示在分割、姿态、边缘和深度等多种控制任务上显著优于ControlNet++，错误率分别降低10%以上、70%-80%和2%-5%。 Conclusion: CPO是一种高效、低方差的可控性优化方法，通过在控制条件上进行偏好学习，显著提升了文本到图像生成的可控性，同时减少了数据构建的计算与存储开销。 Abstract: To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., $t < 200$) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ($I^{w}$) over less controllable ones ($I^{l}$). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, $\mathbf{c}^{w}$ and $\mathbf{c}^{l}$, and train the model to prefer $\mathbf{c}^{w}$. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over $10\%$ error rate reduction in segmentation, $70$--$80\%$ in human pose, and consistent $2$--$5\%$ reductions in edge and depth maps.

[48] DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation

Dhenenjay Yadav,Rohan Sawai

Main category: cs.CV

TL;DR: 本文提出了DARN，一种用于卫星图像分析的新型解码器架构，通过动态调节正则化策略提升基础模型在地理空间任务中的适应性与鲁棒性。

Details

Motivation: 现有基础模型的适应方法采用固定正则化策略，难以应对卫星图像的高度异质性，限制了其在复杂地理空间任务中的表现。 Method: 提出DARN架构，包含三个核心组件：任务复杂度预测器（TCP）、自适应Dropout调制（ADM）和动态通道门控（DCG），实现基于样本难度的动态正则化与容量控制。 Result: 在全微调和高效适应两种范式下均取得优异性能：GeoBench上mIoU达86.66%（提升5.56个百分点），Sen1Floods11上达90.5%；同时在OOD泛化、抗干扰鲁棒性和少数类性能方面显著优于现有方法。 Conclusion: DARN通过动态自适应机制，为地理空间分析提供了一种更智能、稳健且高效的模型适应方案，具有良好的实际部署潜力。 Abstract: Foundation models (FMs) offer powerful representations for geospatial analysis, but adapting them effectively remains challenging. Standard adaptation methods, whether full fine-tuning or efficient frozen-backbone approaches, typically employ decoders with fixed regularization strategies, failing to account for the significant heterogeneity in satellite imagery. We introduce Dynamic Adaptive Regularization Networks (DARN), a novel decoder architecture designed to address this limitation. DARN integrates three key innovations: (1) a lightweight Task Complexity Predictor (TCP) that estimates per-sample difficulty, (2) Adaptive Dropout Modulation (ADM), dynamically adjusting dropout rates (from 0.1 to 0.5) based on predicted complexity, and (3) Dynamic Capacity Gating (DCG) that modulates channel activation. We provide theoretical justifications linking DARN's optimization to stationary point convergence and its mechanism to adaptive information bottlenecks. Empirically, DARN demonstrates exceptional performance across both major adaptation paradigms. In full fine-tuning (unfrozen backbone), DARN achieves a new state-of-the-art on the multi-task GeoBench benchmark (86.66% mIoU, +5.56 pp over prior SOTA). In efficient adaptation (frozen backbone), DARN achieves SOTA-competitive accuracy (90.5% mIoU on Sen1Floods11) while delivering substantial advantages crucial for real-world deployment: superior out-of-distribution (OOD) generalization (+9.5 pp mIoU on AI4SmallFarms), enhanced robustness (17% relative reduction in corruption error), and improved performance on minority classes. DARN offers a more intelligent, robust, and efficient approach to leveraging FMs in critical geospatial applications.

[49] Global 3D Reconstruction of Clouds & Tropical Cyclones

Shirin Ermis,Cesar Aybar,Lilli Freischem,Stella Girtsou,Kyriaki-Margarita Bintsi,Emiliano Diaz Salas-Porras,Michael Eisinger,William Jones,Anna Jungbluth,Benoit Tremblay

Main category: cs.CV

TL;DR: 提出一种基于预训练-微调框架的机器学习方法，首次实现从多卫星数据生成全球瞬时三维云图，精确重建强热带气旋的三维结构。

Details

Motivation: 准确预测热带气旋（TC）因观测数据有限和云特性解析困难而具有挑战性，现有3D云重建方法在TC频发区和强风暴中验证不足。 Method: 采用预训练-微调框架，利用全球覆盖的多卫星数据，将2D卫星图像转换为包含关键云特性的3D云图，并在自建的TC数据集上进行评估。 Result: 首次实现了全球范围内瞬时3D云图生成，能精确重建强风暴的3D结构，并可在无观测数据时提供估计。 Conclusion: 该模型不仅扩展了卫星观测能力，还为理解热带气旋增强机制和改进预报提供了重要工具。 Abstract: Accurate forecasting of tropical cyclones (TCs) remains challenging due to limited satellite observations probing TC structure and difficulties in resolving cloud properties involved in TC intensification. Recent research has demonstrated the capabilities of machine learning methods for 3D cloud reconstruction from satellite observations. However, existing approaches have been restricted to regions where TCs are uncommon, and are poorly validated for intense storms. We introduce a new framework, based on a pre-training--fine-tuning pipeline, that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties. We apply our model to a custom-built TC dataset to evaluate performance in the most challenging and relevant conditions. We show that we can - for the first time - create global instantaneous 3D cloud maps and accurately reconstruct the 3D structure of intense storms. Our model not only extends available satellite observations but also provides estimates when observations are missing entirely. This is crucial for advancing our understanding of TC intensification and improving forecasts.

[50] EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear

Andrea Aspesi,Andrea Simpsi,Aaron Tognoli,Simone Mentasti,Luca Merigo,Matteo Matteucci

Main category: cs.CV

TL;DR: 本文提出了一种专为基于事件数据的眼动追踪设计的卷积神经网络EETnet，可在资源受限的微控制器上运行，并提出了训练、评估和量化该网络的方法。

Details

Motivation: 现有基于事件相机的眼动追踪方案多依赖于高性能GPU，缺乏在真实嵌入式设备上的部署能力，限制了其在低功耗场景的应用。 Method: 设计了两种EETnet架构：一种是在图像网格上检测瞳孔的分类模型，另一种是像素级操作的回归模型；并使用公开数据集进行训练、评估与量化。 Result: EETnet能够在资源受限的微控制器上高效运行，适用于低功耗、低延迟的眼动追踪应用。 Conclusion: EETnet实现了纯事件驱动的眼动追踪，在保持低延迟和高效率的同时，成功部署于嵌入式设备，推动了其在实际场景中的应用。 Abstract: Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.

[51] 3D Gaussian Point Encoders

Jim James,Ben Wilson,Simon Lucey,James Hays

Main category: cs.CV

TL;DR: 提出3D高斯点编码器，一种基于学习的3D高斯混合的显式每点嵌入，相比PointNet更高效、快速且参数更少。

Details

Motivation: 克服隐式表示（如PointNet）在3D识别任务中的效率和参数开销问题，探索显式几何表示的优势。 Method: 采用自然梯度和知识蒸馏技术，从PointNet中学习3D高斯基，构建可重建PointNet激活的显式3D高斯点编码器，并引入3D高斯点阵列中的过滤技术以加速。 Result: 3D高斯点编码器比PointNet快2.7倍，内存减少46%，FLOPs减少88%；在Mamba3D中快1.27倍，内存和FLOPs分别减少42%和54%，并在CPU设备上实现高帧率。 Conclusion: 3D高斯点编码器是一种高效、轻量且适用于实时应用的显式3D表示方法，显著优于传统PointNet。 Abstract: In this work, we introduce the 3D Gaussian Point Encoder, an explicit per-point embedding built on mixtures of learned 3D Gaussians. This explicit geometric representation for 3D recognition tasks is a departure from widely used implicit representations such as PointNet. However, it is difficult to learn 3D Gaussian encoders in end-to-end fashion with standard optimizers. We develop optimization techniques based on natural gradients and distillation from PointNets to find a Gaussian Basis that can reconstruct PointNet activations. The resulting 3D Gaussian Point Encoders are faster and more parameter efficient than traditional PointNets. As in the 3D reconstruction literature where there has been considerable interest in the move from implicit (e.g., NeRF) to explicit (e.g., Gaussian Splatting) representations, we can take advantage of computational geometry heuristics to accelerate 3D Gaussian Point Encoders further. We extend filtering techniques from 3D Gaussian Splatting to construct encoders that run 2.7 times faster as a comparable accuracy PointNet while using 46% less memory and 88% fewer FLOPs. Furthermore, we demonstrate the effectiveness of 3D Gaussian Point Encoders as a component in Mamba3D, running 1.27 times faster and achieving a reduction in memory and FLOPs by 42% and 54% respectively. 3D Gaussian Point Encoders are lightweight enough to achieve high framerates on CPU-only devices.

[52] Data Efficiency and Transfer Robustness in Biomedical Image Segmentation: A Study of Redundancy and Forgetting with Cellpose

Shuo Zhao,Jianxu Chen

Main category: cs.CV

TL;DR: 本研究以Cellpose为例，系统分析了生物医学图像分割中训练数据冗余和跨域迁移对模型记忆的影响。提出数据量化（DQ）策略，发现仅用10%数据即可达到性能饱和，并利用DQ回放有效缓解灾难性遗忘，同时优化域序列提升泛化能力。

Details

Motivation: 探讨通用生物医学图像分割模型中存在的训练数据冗余和跨域迁移中的模型遗忘问题，旨在提高训练效率并保持多域性能。 Method: 提出数据量化（DQ）策略构建紧凑且多样化的训练子集，结合MAE嵌入和t-SNE进行潜在空间分析；通过跨域微调实验评估灾难性遗忘，并采用选择性DQ回放和不同训练域顺序进行缓解。 Result: 在Cyto数据集上，仅使用10%数据即可实现分割性能饱和；DQ选出的样本具有更高特征多样性；引入5-10%源域数据回放可恢复源域性能；合理的训练域顺序有助于提升泛化并减少遗忘。 Conclusion: 生物医学图像分割需重视数据中心化设计，高效的训练不仅依赖精简数据集，还需结合保留记忆的学习策略和合理的域训练顺序。 Abstract: Generalist biomedical image segmentation models such as Cellpose are increasingly applied across diverse imaging modalities and cell types. However, two critical challenges remain underexplored: (1) the extent of training data redundancy and (2) the impact of cross domain transfer on model retention. In this study, we conduct a systematic empirical analysis of these challenges using Cellpose as a case study. First, to assess data redundancy, we propose a simple dataset quantization (DQ) strategy for constructing compact yet diverse training subsets. Experiments on the Cyto dataset show that image segmentation performance saturates with only 10% of the data, revealing substantial redundancy and potential for training with minimal annotations. Latent space analysis using MAE embeddings and t-SNE confirms that DQ selected patches capture greater feature diversity than random sampling. Second, to examine catastrophic forgetting, we perform cross domain finetuning experiments and observe significant degradation in source domain performance, particularly when adapting from generalist to specialist domains. We demonstrate that selective DQ based replay reintroducing just 5-10% of the source data effectively restores source performance, while full replay can hinder target adaptation. Additionally, we find that training domain sequencing improves generalization and reduces forgetting in multi stage transfer. Our findings highlight the importance of data centric design in biomedical image segmentation and suggest that efficient training requires not only compact subsets but also retention aware learning strategies and informed domain ordering. The code is available at https://github.com/MMV-Lab/biomedseg-efficiency.

[53] An Active Learning Pipeline for Biomedical Image Instance Segmentation with Minimal Human Intervention

Shuo Zhao,Yu Zhou,Jianxu Chen

Main category: cs.CV

TL;DR: 提出一种结合主动学习和伪标签的数据中心AI工作流，用于生物医学图像分割，显著减少人工标注需求的同时保持良好性能。

Details

Motivation: 传统方法在噪声数据上表现不佳，深度学习模型如nnU-Net需要大量标注数据进行交叉验证，而基础模型虽具零样本泛化能力但在特定数据集上表现有限，因此需要一种减少人工标注依赖并提升特定任务性能的方法。 Method: 利用基础模型生成伪标签，用于nnU-Net的自配置；通过核心集选择代表性样本进行少量人工标注，并用于nnU-Net的微调，结合主动学习与伪标签策略。 Result: 所提方法在减少人工标注量的同时保持了有竞争力的分割性能，适用于标注数据稀缺的生物医学图像分析场景。 Conclusion: 该数据驱动的工作流有效融合了基础模型的零样本能力和nnU-Net的高精度特性，为生物医学图像分割提供了低标注成本、高性能的实用解决方案。 Abstract: Biomedical image segmentation is critical for precise structure delineation and downstream analysis. Traditional methods often struggle with noisy data, while deep learning models such as U-Net have set new benchmarks in segmentation performance. nnU-Net further automates model configuration, making it adaptable across datasets without extensive tuning. However, it requires a substantial amount of annotated data for cross-validation, posing a challenge when only raw images but no labels are available. Large foundation models offer zero-shot generalizability, but may underperform on specific datasets with unique characteristics, limiting their direct use for analysis. This work addresses these bottlenecks by proposing a data-centric AI workflow that leverages active learning and pseudo-labeling to combine the strengths of traditional neural networks and large foundation models while minimizing human intervention. The pipeline starts by generating pseudo-labels from a foundation model, which are then used for nnU-Net's self-configuration. Subsequently, a representative core-set is selected for minimal manual annotation, enabling effective fine-tuning of the nnU-Net model. This approach significantly reduces the need for manual annotations while maintaining competitive performance, providing an accessible solution for biomedical researchers to apply state-of-the-art AI techniques in their segmentation tasks. The code is available at https://github.com/MMV-Lab/AL_BioMed_img_seg.

[54] Geometry Denoising with Preferred Normal Vectors

Manuel Weiß,Lukas Baumgärtner,Roland Herzog,Stephan Schmidt

Main category: cs.CV

TL;DR: 提出一种基于表面法向量先验知识的几何去噪新方法，通过标签向量进行法向量相似性分割，并结合全变分正则化和分裂Bregman优化。

Details

Motivation: 传统几何去噪方法缺乏对表面结构先验信息的有效利用，导致在复杂噪声下保持几何特征的能力不足。 Method: 引入一组预定义的标签向量作为法向量先验；将去噪过程与基于法向量与标签向量相似性的分割问题结合；采用全变分项进行正则化，并使用分裂Breg曼（ADMM）框架求解优化问题，顶点更新基于二阶形状微积分。 Result: 该方法能有效去除噪声同时保留显著几何特征，在法向量一致性和表面质量方面优于传统方法。 Conclusion: 结合法向量先验与分割思想的去噪框架是有效的，分裂Bregman算法配合形状导数更新可提升几何恢复精度。 Abstract: We introduce a new paradigm for geometry denoising using prior knowledge about the surface normal vector. This prior knowledge comes in the form of a set of preferred normal vectors, which we refer to as label vectors. A segmentation problem is naturally embedded in the denoising process. The segmentation is based on the similarity of the normal vector to the elements of the set of label vectors. Regularization is achieved by a total variation term. We formulate a split Bregman (ADMM) approach to solve the resulting optimization problem. The vertex update step is based on second-order shape calculus.

[55] Self-Supervised Implicit Attention Priors for Point Cloud Reconstruction

Kyle Fogarty,Chenyue Cai,Jing Yang,Zhilin Guo,Cengiz Öztireli

Main category: cs.CV

TL;DR: 提出一种隐式自先验方法，通过从输入点云本身提取形状特定先验并嵌入隐式神经表示中，实现高质量曲面重建。

Details

Motivation: 在缺乏强几何先验的情况下，从不规则点云恢复高质量表面是病态问题。 Method: 联合训练可学习嵌入的小型字典与隐式距离场，通过交叉注意力机制在查询位置捕捉并重用形状内部的重复结构和长程相关性；使用自监督点云重建损失进行优化，并结合RIMLS框架融合密集采样点与法线。 Result: 该方法在无需外部训练数据的情况下，优于传统和基于学习的方法，在细节保持和对常见数据退化鲁棒性方面表现更优。 Conclusion: 所提出的混合策略能有效利用学习到的先验正则化稀疏区域，同时保留输入的精细几何细节，显著提升曲面重建质量。 Abstract: Recovering high-quality surfaces from irregular point cloud is ill-posed unless strong geometric priors are available. We introduce an implicit self-prior approach that distills a shape-specific prior directly from the input point cloud itself and embeds it within an implicit neural representation. This is achieved by jointly training a small dictionary of learnable embeddings with an implicit distance field; at every query location, the field attends to the dictionary via cross-attention, enabling the network to capture and reuse repeating structures and long-range correlations inherent to the shape. Optimized solely with self-supervised point cloud reconstruction losses, our approach requires no external training data. To effectively integrate this learned prior while preserving input fidelity, the trained field is then sampled to extract densely distributed points and analytic normals via automatic differentiation. We integrate the resulting dense point cloud and corresponding normals into a robust implicit moving least squares (RIMLS) formulation. We show this hybrid strategy preserves fine geometric details in the input data, while leveraging the learned prior to regularize sparse regions. Experiments show that our method outperforms both classical and learning-based approaches in generating high-fidelity surfaces with superior detail preservation and robustness to common data degradations.

[56] Clinical-ComBAT: a diffusion-weighted MRI harmonization method for clinical applications

Gabriel Girard,Manon Edde,Félix Dumais,Yoan David,Matthieu Dumont,Guillaume Theaud,Jean-Christophe Houde,Arnaud Boré,Maxime Descoteaux,Pierre-Marc Jodoin

Main category: cs.CV

TL;DR: 提出了一种名为Clinical-ComBAT的新方法，用于在真实临床场景中对多中心扩散加权MRI数据进行标准化，克服了传统ComBAT方法的局限性。

Details

Motivation: 传统ComBAT方法依赖线性协变量关系、同质人群、固定且样本量大的站点，限制了其在临床中的应用。需要一种更灵活、适用于小样本和动态新增站点的方法。 Method: Clinical-ComBAT采用非线性多项式数据模型，以规范站点为参考进行站点特异性标准化，引入可适应小样本的方差先验，并支持超参数调优和标准化效果的拟合优度评估。每个站点独立标准化，支持新数据和新临床站点的灵活加入。 Result: 在模拟和真实数据上验证了Clinical-ComBAT的有效性，显示其能更好对齐扩散指标，提升多中心数据的一致性，并增强在规范建模中的适用性。 Conclusion: Clinical-ComBAT是一种适用于真实临床环境的DW-MRI数据标准化方法，具有更高的灵活性和实用性，尤其适合动态、小样本和多中心临床研究。 Abstract: Diffusion-weighted magnetic resonance imaging (DW-MRI) derived scalar maps are effective for assessing neurodegenerative diseases and microstructural properties of white matter in large number of brain conditions. However, DW-MRI inherently limits the combination of data from multiple acquisition sites without harmonization to mitigate scanner-specific biases. While the widely used ComBAT method reduces site effects in research, its reliance on linear covariate relationships, homogeneous populations, fixed site numbers, and well populated sites constrains its clinical use. To overcome these limitations, we propose Clinical-ComBAT, a method designed for real-world clinical scenarios. Clinical-ComBAT harmonizes each site independently, enabling flexibility as new data and clinics are introduced. It incorporates a non-linear polynomial data model, site-specific harmonization referenced to a normative site, and variance priors adaptable to small cohorts. It further includes hyperparameter tuning and a goodness-of-fit metric for harmonization assessment. We demonstrate its effectiveness on simulated and real data, showing improved alignment of diffusion metrics and enhanced applicability for normative modeling.

[57] Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects

James Ndubuisi,Fernando Auat,Marta Vallejo

Main category: cs.CV

TL;DR: 本研究比较了Swin Transformer与传统CNN在耳部疾病诊断中的表现，初始准确率高达100%，但发现数据泄露问题后性能显著下降至83%，强调了医学机器学习中严谨数据处理的重要性。

Details

Motivation: 耳科专家误诊率高达27%，亟需提高耳部疾病诊断的准确性，探索视觉Transformer是否优于传统卷积网络。 Method: 使用智利大学医院的真实耳镜视频数据集，基于Laplacian和Shannon熵筛选帧并去除空白帧，对比Swin v1、Swin v2与ResNet模型的分类准确率，并识别和修正数据泄露问题。 Result: 未修正前Swin v1和v2分别达到100%和99.1%准确率，ResNet为99.5%；修正数据泄露后，三者准确率均下降至83%（Swin v1/v2）和82%（ResNet）。 Conclusion: 视觉Transformer虽有潜力，但数据泄露严重影响性能评估；医学AI研究必须严格把控数据预处理，平衡模型架构与数据质量以构建可靠系统。 Abstract: This study evaluates the efficacy of vision transformer models, specifically Swin transformers, in enhancing the diagnostic accuracy of ear diseases compared to traditional convolutional neural networks. With a reported 27% misdiagnosis rate among specialist otolaryngologists, improving diagnostic accuracy is crucial. The research utilised a real-world dataset from the Department of Otolaryngology at the Clinical Hospital of the Universidad de Chile, comprising otoscopic videos of ear examinations depicting various middle and external ear conditions. Frames were selected based on the Laplacian and Shannon entropy thresholds, with blank frames removed. Initially, Swin v1 and Swin v2 transformer models achieved accuracies of 100% and 99.1%, respectively, marginally outperforming the ResNet model (99.5%). These results surpassed metrics reported in related studies. However, the evaluation uncovered a critical data leakage issue in the preprocessing step, affecting both this study and related research using the same raw dataset. After mitigating the data leakage, model performance decreased significantly. Corrected accuracies were 83% for both Swin v1 and Swin v2, and 82% for the ResNet model. This finding highlights the importance of rigorous data handling in machine learning studies, especially in medical applications. The findings indicate that while vision transformers show promise, it is essential to find an optimal balance between the benefits of advanced model architectures and those derived from effective data preprocessing. This balance is key to developing a reliable machine learning model for diagnosing ear diseases.

[58] Beta Distribution Learning for Reliable Roadway Crash Risk Assessment

Ahmad Elallaf,Nathan Jacobs,Xinyue Ye,Mei Chen,Gongbo Liang

Main category: cs.CV

TL;DR: 提出一种基于卫星图像的地理空间深度学习框架，通过估计致命交通事故风险的完整Beta概率分布，提供准确且具有不确定性感知的预测，显著提升召回率和校准性能。

Details

Motivation: 传统交通安全研究常孤立分析风险因素，忽略建成环境中的空间复杂性和上下文交互；现有神经网络模型缺乏不确定性量化，限制了其在关键决策中的应用。 Method: 利用卫星图像作为综合空间输入，构建新型地理空间深度学习模型，输出致命事故风险的Beta概率分布，而非单一确定性值，实现对风险的不确定性感知预测。 Result: 模型在召回率上比基线提高17-23%，校准性能更优，能够从卫星图像中生成可靠且可解释的风险评估。 Conclusion: 该方法为自动驾驶安全导航提供了可信AI支持，并为城市规划者和政策制定者提供了一种可扩展、公平且低成本的道路安全增强工具。 Abstract: Roadway traffic accidents represent a global health crisis, responsible for over a million deaths annually and costing many countries up to 3% of their GDP. Traditional traffic safety studies often examine risk factors in isolation, overlooking the spatial complexity and contextual interactions inherent in the built environment. Furthermore, conventional Neural Network-based risk estimators typically generate point estimates without conveying model uncertainty, limiting their utility in critical decision-making. To address these shortcomings, we introduce a novel geospatial deep learning framework that leverages satellite imagery as a comprehensive spatial input. This approach enables the model to capture the nuanced spatial patterns and embedded environmental risk factors that contribute to fatal crash risks. Rather than producing a single deterministic output, our model estimates a full Beta probability distribution over fatal crash risk, yielding accurate and uncertainty-aware predictions--a critical feature for trustworthy AI in safety-critical applications. Our model outperforms baselines by achieving a 17-23% improvement in recall, a key metric for flagging potential dangers, while delivering superior calibration. By providing reliable and interpretable risk assessments from satellite imagery alone, our method enables safer autonomous navigation and offers a highly scalable tool for urban planners and policymakers to enhance roadway safety equitably and cost-effectively.

[59] Learning to Restore Multi-Degraded Images via Ingredient Decoupling and Task-Aware Path Adaptation

Hu Gao,Xiaoning Lei,Ying Zhang,Xichen Xu,Guannan Jiang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了一种自适应多退化图像恢复网络IMDNet，通过解耦退化成分表示来指导路径选择，有效处理多种共存退化问题。

Details

Motivation: 现有图像恢复方法大多针对单一退化类型，难以应对真实场景中多种退化共存的情况，限制了实际应用效果。 Method: 设计退化成分解耦模块（DIDBlock）分离不同退化特征，结合可学习矩阵的融合模块（FBlock），并在解码器中引入任务自适应模块（TABlock）动态选择恢复路径。 Result: IMDNet在多退化图像恢复任务上表现出优越性能，同时在单退化任务上也保持竞争力。 Conclusion: 该方法通过解耦与自适应机制，实现了对复杂多退化图像的有效恢复，提升了模型的泛化能力与实用性。 Abstract: Image restoration (IR) aims to recover clean images from degraded observations. Despite remarkable progress, most existing methods focus on a single degradation type, whereas real-world images often suffer from multiple coexisting degradations, such as rain, noise, and haze coexisting in a single image, which limits their practical effectiveness. In this paper, we propose an adaptive multi-degradation image restoration network that reconstructs images by leveraging decoupled representations of degradation ingredients to guide path selection. Specifically, we design a degradation ingredient decoupling block (DIDBlock) in the encoder to separate degradation ingredients statistically by integrating spatial and frequency domain information, enhancing the recognition of multiple degradation types and making their feature representations independent. In addition, we present fusion block (FBlock) to integrate degradation information across all levels using learnable matrices. In the decoder, we further introduce a task adaptation block (TABlock) that dynamically activates or fuses functional branches based on the multi-degradation representation, flexibly selecting optimal restoration paths under diverse degradation conditions. The resulting tightly integrated architecture, termed IMDNet, is extensively validated through experiments, showing superior performance on multi-degradation restoration while maintaining strong competitiveness on single-degradation tasks.

[60] A benchmark multimodal oro-dental dataset for large vision-language models

Haoxin Lv,Ijazul Haq,Jin Du,Jiaxin Ma,Binnian Zhu,Xiaobing Dang,Chaoan Liang,Ruxu Du,Yingjie Zhang,Muhammad Saqib

Main category: cs.CV

TL;DR: 本文介绍了一个包含8775次牙科检查的多模态数据集，涵盖5万张口腔图像、8056张X光片及详细文本记录，并通过微调视觉-语言模型验证其有效性，推动AI在口腔医疗中的应用。

Details

Motivation: 为了推动人工智能在口腔医疗中的发展，需要大规模、多模态且贴近临床实践的数据集，以支持更精准的诊断与治疗建议。 Method: 收集了来自4800名患者（2018–2025年）的多模态数据，包括图像、放射影像和文本记录，并对Qwen-VL 3B/7B模型进行微调，评估其在疾病分类和诊断报告生成任务上的表现，与基线模型及GPT-4o对比。 Result: 微调后的Qwen-VL模型在六类口腔疾病分类和诊断报告生成任务中显著优于基线模型和GPT-4o，证明该数据集的有效性和实用性。 Conclusion: 该多模态数据集为AI驱动的口腔健康研究提供了重要资源，有助于推动智能牙科诊疗系统的发展。 Abstract: The advancement of artificial intelligence in oral healthcare relies on the availability of large-scale multimodal datasets that capture the complexity of clinical practice. In this paper, we present a comprehensive multimodal dataset, comprising 8775 dental checkups from 4800 patients collected over eight years (2018-2025), with patients ranging from 10 to 90 years of age. The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes. The data were collected under standard ethical guidelines and annotated for benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports from multimodal inputs. We compared the fine-tuned models with their base counterparts and GPT-4o. The fine-tuned models achieved substantial gains over these baselines, validating the dataset and underscoring its effectiveness in advancing AI-driven oro-dental healthcare solutions. The dataset is publicly available, providing an essential resource for future research in AI dentistry.

[61] DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

Tharindu Fernando,Clinton Fookes,Sridha Sridharan

Main category: cs.CV

TL;DR: 本文提出了一种基于高维潜在空间表示和多智能体对抗强化学习（MAARL）的新型深度学习框架，用于主动式深伪检测中的鲁棒自适应水印技术。

Details

Motivation: 现有的被动深伪检测方法难以泛化到新型深伪内容，而现有水印方法在鲁棒性与敏感性之间难以平衡。 Method: 设计一个可学习的潜在空间水印嵌入器，并利用MAARL范式让水印代理与模拟良性和恶意图像操作的攻击代理交互，以动态优化鲁棒性与脆弱性的平衡。 Result: 在CelebA和CelebA-HQ数据集上，该方法在复杂操作场景下分别比现有最优方法提升超过4.5%和5.3%。 Conclusion: 所提出的框架能有效提升水印在面对各种图像操作时的鲁棒性和对恶意篡改的敏感性，显著优于现有方法。 Abstract: Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.

[62] CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

Hexu Zhao,Xiwen Min,Xiaoteng Liu,Moonjun Gong,Yiming Li,Ang Li,Saining Xie,Jinyang Li,Aurojit Panda

Main category: cs.CV

TL;DR: 提出CLM系统，通过将高斯点云卸载到CPU内存并在需要时加载到GPU，实现单个消费级GPU上高效渲染大规模3D场景。

Details

Motivation: 3D高斯点阵渲染（3DGS）在高质量新视角合成中表现出色，但其高内存需求限制了在大规模或复杂场景中的扩展性，难以在常规GPU上运行。 Method: 设计CLM系统，利用3DGS的内存访问模式特点，将高斯点云存储于CPU内存，按需加载至GPU；采用新型卸载策略，通过流水线重叠CPU-GPU通信、GPU计算和CPU计算，并减少通信量以降低开销。 Result: 在单个RTX4090 GPU上成功渲染包含1亿个高斯点的大规模场景，性能优越，并达到当前最优的重建质量。 Conclusion: CLM有效解决了3DGS在大规模场景下的内存瓶颈问题，使消费级GPU也能高效渲染复杂3D场景，推动了3DGS的实际应用扩展。 Abstract: 3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU's memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS's memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.

Xiongri Shen,Jiaqi Wang,Yi Zhong,Zhenxi Song,Leilei Zhao,Yichen Wei,Lingyan Liang,Shuqiang Wang,Baiying Lei,Demao Deng,Zhiguo Zhang

Main category: cs.CV

TL;DR: 提出PDS方法，通过模式感知的双模态3D扩散框架和组织细化网络，实现fMRI和dMRI的高质量跨模态合成，在多个数据集上达到SOTA性能。

Details

Motivation: 解决fMRI和dMRI在临床应用中因模态缺失导致的问题，尤其是现有GAN和扩散模型在fMRI-dMRI合成中因信号差异大和疾病相关神经解剖模式整合不足而导致的局限性。 Method: 提出PDS，包含两个创新：1）模式感知的双模态3D扩散框架用于跨模态学习；2）结合高效微结构细化的组织细化网络，以保持结构保真度和细节。 Result: 在OASIS-3、ADNI和自建数据集上，fMRI合成PSNR/SSIM达29.83 dB/90.84%（优于基线1.54 dB/4.12%），dMRI达30.00 dB/77.55%（+1.02 dB/2.2%）。临床验证中，混合真实-合成数据对NC/MCI/AD分类准确率分别为67.92%/66.02%/64.15%。 Conclusion: PDS在fMRI和dMRI跨模态合成中表现优越，显著提升图像质量和诊断可用性，具有良好的临床应用潜力。 Abstract: Magnetic resonance imaging (MRI), especially functional MRI (fMRI) and diffusion MRI (dMRI), is essential for studying neurodegenerative diseases. However, missing modalities pose a major barrier to their clinical use. Although GAN- and diffusion model-based approaches have shown some promise in modality completion, they remain limited in fMRI-dMRI synthesis due to (1) significant BOLD vs. diffusion-weighted signal differences between fMRI and dMRI in time/gradient axis, and (2) inadequate integration of disease-related neuroanatomical patterns during generation. To address these challenges, we propose PDS, introducing two key innovations: (1) a pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) a tissue refinement network integrated with a efficient microstructure refinement to maintain structural fidelity and fine details. Evaluated on OASIS-3, ADNI, and in-house datasets, our method achieves state-of-the-art results, with PSNR/SSIM scores of 29.83 dB/90.84\% for fMRI synthesis (+1.54 dB/+4.12\% over baselines) and 30.00 dB/77.55\% for dMRI synthesis (+1.02 dB/+2.2\%). In clinical validation, the synthesized data show strong diagnostic performance, achieving 67.92\%/66.02\%/64.15\% accuracy (NC vs. MCI vs. AD) in hybrid real-synthetic experiments. Code is available in \href{https://github.com/SXR3015/PDS}{PDS GitHub Repository}

[64] Learning Fourier shapes to probe the geometric world of deep neural networks

Jian Wang,Yixing Yong,Haixia Bi,Lijun He,Fan Li

Main category: cs.CV

TL;DR: 提出了一种端到端可微框架，利用优化形状作为语义载体，揭示深度神经网络的几何理解能力，并展示其在模型解释性和对抗攻击中的应用。

Details

Motivation: 深度神经网络在视觉识别中更关注纹理而忽视形状，缺乏对其几何理解能力的深入探究。 Method: 采用傅里叶级数参数化任意形状，结合基于绕数的映射将其转换为像素网格，并引入信号能量约束以提高优化效率和物理合理性。 Result: 优化后的形状能独立引发高置信度分类，精确识别模型显著区域，并构成可迁移的新型对抗范式。 Conclusion: 该框架为研究DNN的几何感知提供了新工具，拓展了机器视觉理解的新方向。 Abstract: While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model's salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

[65] Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features

Dylan Peek,Matthew P. Skerritt,Siddharth Pritam,Stephan Chalup

Main category: cs.CV

TL;DR: 提出了一种基于Repulsive Surface算法的3D数据集生成方法，用于训练和评估拓扑数据分析中的神经网络模型。

Details

Motivation: 传统拓扑数据分析方法计算成本高，且缺乏适用于监督学习的标注3D数据集。 Method: 使用Repulsive Surface算法生成具有可控拓扑不变量（如孔数）的标注3D数据，并采用3D卷积Transformer架构训练属数估计网络。 Result: 生成的数据集具有多样化的几何形状和拓扑标签，实验显示随着形变增加，网络精度下降，表明几何复杂性对泛化估计器的影响。 Conclusion: 该数据集填补了TDA领域中用于训练和评估模型的标注3D数据的空白。 Abstract: Topological Data Analysis (TDA) involves techniques of analyzing the underlying structure and connectivity of data. However, traditional methods like persistent homology can be computationally demanding, motivating the development of neural network-based estimators capable of reducing computational overhead and inference time. A key barrier to advancing these methods is the lack of labeled 3D data with class distributions and diversity tailored specifically for supervised learning in TDA tasks. To address this, we introduce a novel approach for systematically generating labeled 3D datasets using the Repulsive Surface algorithm, allowing control over topological invariants, such as hole count. The resulting dataset offers varied geometry with topological labeling, making it suitable for training and benchmarking neural network estimators. This paper uses a synthetic 3D dataset to train a genus estimator network, created using a 3D convolutional transformer architecture. An observed decrease in accuracy as deformations increase highlights the role of not just topological complexity, but also geometric complexity, when training generalized estimators. This dataset fills a gap in labeled 3D datasets and generation for training and evaluating models and techniques for TDA.

[66] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang

Main category: cs.CV

TL;DR: 本文提出了贴纸语义相似性任务的定义，并构建了首个基准数据集Triple-S，同时提出了一种轻量且通用的模型GSE来学习贴纸的鲁棒嵌入表示，在多个下游任务中表现出色。

Details

Motivation: 由于贴纸内容高度多样化和符号化，现有模型难以理解其语义关系，因此需要专门的任务定义、评估基准和专用模型来推动贴纸语义理解的研究。 Method: 提出了Triple-S基准数据集（包含905对人工标注的正负样本），并设计了通用贴纸编码器（GSE），利用Triple-S及其他数据集进行训练，以学习贴纸的鲁棒嵌入表示。 Result: 实验表明现有预训练视觉与多模态模型在贴纸语义理解上表现不佳；GSE在未见贴纸上表现优异，并在情感分类和贴纸检索等下游任务中取得良好效果。 Conclusion: 通过发布Triple-S和GSE，为贴纸语义理解、检索及多模态生成研究提供了标准化评估工具和有效的嵌入模型，推动该领域的发展。 Abstract: Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.

[67] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Aakriti Agrawal,Gouthaman KV,Rohith Aralikatti,Gauri Jagatap,Jiaxin Yuan,Vijay Kamarshi,Andrea Fanelli,Furong Huang

Main category: cs.CV

TL;DR: 本文指出现有视觉语言模型（LVLM）架构中存在对语言模态的固有偏见，提出通过将平均池化的视觉特征融入文本嵌入来缓解该问题，有效提升了视觉定位能力并减少了幻觉现象。

Details

Motivation: 发现当前LVLM架构因简单拼接视觉嵌入与文本序列而导致语言模态偏倚，进而加剧模型幻觉问题。 Method: 通过将平均池化的视觉特征整合到文本嵌入中，以优化文本表示并增强跨模态对齐。 Result: 所提方法在标准基准上显著改善了视觉接地效果，并明显减少了模型生成的幻觉内容。 Conclusion: 修正文本嵌入中的视觉信息可有效缓解模态不平衡问题，未来可探索更复杂的融合策略以进一步提升性能。 Abstract: In this work, we identify an inherent bias in prevailing LVLM architectures toward the language modality, largely resulting from the common practice of simply appending visual embeddings to the input text sequence. To address this, we propose a simple yet effective method that refines textual embeddings by integrating average-pooled visual features. Our approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks. While average pooling offers a straightforward, robust, and efficient means of incorporating visual information, we believe that more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment. Given that the primary focus of this work is to highlight the modality imbalance and its impact on hallucinations -- and to show that refining textual embeddings with visual information mitigates this issue -- we leave exploration of advanced fusion strategies for future work.

[68] Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

Jing Jin,Xu Liu,Te Gao,Zhihong Shi,Yixiong Liang,Ruiqing Zheng,Hulin Kuang,Min Zeng,Shichao Kan

Main category: cs.CV

TL;DR: 提出了一种基于动态残差编码和滑动级别对比学习的全切片图像（WSI）表征方法DRE-SLCL，通过引入记忆库和残差编码实现端到端训练，在癌症亚型分类、识别和突变预测任务中表现出优越性能。

Details

Motivation: 由于全切片图像包含数十万图像块，受限于GPU内存，难以在单个mini-batch中计算所有图像块的梯度，因此需要一种高效的端到端WSI表征学习方法。 Method: 采用记忆库存储所有WSI的图像块特征；在训练时，对每个mini-batch中的WSI随机采样部分图像块，并从记忆库中检索其余特征，结合两者通过残差编码生成WSI表征，最后利用滑动级别对比损失进行优化。 Result: 在癌症亚型分类、癌症识别和突变预测任务上的实验表明，所提DRE-SLCL方法优于现有方法，有效提升了WSI表征能力。 Conclusion: DRE-SLCL通过动态融合在线采样与记忆库特征，并引入滑动级别对比学习，实现了高效且强大的端到端WSI表征学习，具有良好的应用前景。 Abstract: Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

[69] Pressure2Motion: Hierarchical Motion Synthesis from Ground Pressure with Text Guidance

Zhengxuan Li,Qinhui Yang,Yiyu Zhuang,Chuan Guo,Xinxin Zuo,Xiaoxiao Long,Yao Yao,Xun Cao,Qiu Shen,Hao Zhu

Main category: cs.CV

TL;DR: 提出了一种名为Pressure2Motion的新方法，能够根据地面压力序列和文本提示合成人体运动，无需特殊设备，适用于隐私保护、低光和低成本场景。

Details

Motivation: 传统动作捕捉需要专用设备，限制了在隐私敏感或低成本环境中的应用。希望利用压力信号和文本提示实现更便捷、普适的运动捕捉。 Method: 提出一种生成模型，采用双层特征提取器解析压力数据，并结合分层扩散模型生成粗粒度轨迹和细粒度姿态调整，利用压力信号和文本语义共同指导运动生成。 Result: 实验表明该方法能生成高保真、物理合理的运动，显著优于现有方法，在新建立的MPL基准上达到了最先进的性能。 Conclusion: Pressure2Motion是首个结合压力数据与语言先验进行运动生成的工作，为无传感器动作捕捉提供了新的解决方案，具有广泛的应用前景。 Abstract: We present Pressure2Motion, a novel motion capture algorithm that synthesizes human motion from a ground pressure sequence and text prompt. It eliminates the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminate nature of the pressure signals to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint. Specifically, our model utilizes a dual-level feature extractor that accurately interprets pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion generation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion generation, and the established MPL benchmark is the first benchmark for this task. Experiments show our method generates high-fidelity, physically plausible motions, establishing a new state-of-the-art for this task. The codes and benchmarks will be publicly released upon publication.

[70] Medical Referring Image Segmentation via Next-Token Mask Prediction

Xinyu Chen,Yiran Wang,Gaoyang Pang,Jiafu Hao,Chentao Yue,Luping Zhou,Yonghui Li

Main category: cs.CV

TL;DR: 本文提出NTP-MRISeg，将医学指代图像分割（MRIS）重新定义为基于统一多模态序列的自回归下一个词预测任务，简化模型设计并提升性能。

Details

Motivation: 现有MRIS方法通常依赖复杂的多模态融合或多阶段解码器设计，限制了模型的简洁性和可扩展性。 Method: 将图像、文本和掩码表示统一为令牌序列，采用自回归的下一个词预测框架；提出NkTP减少预测误差，TCL增强边界敏感性，HET优化难例训练。 Result: 在QaTa-COV19和MosMedData+数据集上达到SOTA性能，验证了方法的有效性和泛化能力。 Conclusion: NTP-MRISeg通过统一的令牌预测框架，简化了MRIS流程，无需模态特定融合或外部分割模型，为MRIS提供了高效且可扩展的新范式。 Abstract: Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.

[71] No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

Mingyu Sung,Hyeonmin Choe,Il-Min Kim,Sangseok Yun,Jae Mo Kang

Main category: cs.CV

TL;DR: 本文提出了一种新的单目深度估计测试时自适应框架PITTA，无需相机姿态信息且能有效处理动态环境下的域迁移问题。

Details

Motivation: 现有测试时自适应方法在多样化和动态环境中效果不佳，需要更鲁棒的解决方案。 Method: 提出姿态无关的TTA范式和实例感知图像掩码策略，并结合边缘提取方法提升性能。 Result: 在DrivingStereo和Waymo数据集上显著优于现有最先进方法。 Conclusion: PITTA框架在不同环境条件下实现了卓越的单目深度估计性能，具有较强的实用性与扩展性。 Abstract: Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.

[72] Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Yuanxiang Huangfu,Chaochao Wang,Weilei Wang

Main category: cs.CV

TL;DR: 提出Role-SynthCLIP，一种通过多视角角色扮演提示生成语义多样化的图像-文本对的合成数据框架，显著提升CLIP模型性能。

Details

Motivation: 现有合成数据方法侧重增加数据量，但语义多样性不足，导致生成的描述冗余或浅层。 Method: 设计多视角角色扮演提示（如构图分析师、图像上下文解释者），引导多模态大模型从不同角度生成更具语义多样性和细粒度对齐的图像描述。 Result: 仅用100万合成样本训练的CLIP-B/16模型在MS COCO上达到64.1%的Recall@1，超过使用500万样本的最佳基线2.8个百分点。 Conclusion: Role-SynthCLIP通过提升合成数据的语义多样性，在不增加数据规模的前提下显著增强CLIP模型性能。 Abstract: The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.

[73] SurgiATM: A Physics-Guided Plug-and-Play Model for Deep Learning-Based Smoke Removal in Laparoscopic Surgery

Mingyu Sheng,Jianan Fan,Dongnan Liu,Guoyan Zheng,Ron Kikinis,Weidong Cai

Main category: cs.CV

TL;DR: 提出了一种名为SurgiATM的轻量级、即插即用模块，用于术中烟雾去除，结合物理大气模型与深度学习，提升去烟模型的准确性与泛化能力，且不增加可训练参数。

Details

Motivation: 术中烟雾会降低内窥镜图像质量，影响手术安全与视觉分析，因此需要高效、通用且低成本的去烟方法。 Method: 提出SurgiATM，将物理大气模型与数据驱动的深度学习结合，设计为无额外可训练权重的轻量模块，可集成到多种去烟网络中，仅引入两个超参数。 Result: 在三个公开手术数据集上验证了SurgiATM的有效性，结合十种去烟方法和多种网络结构，结果显示其能普遍降低重建误差并提升模型泛化性。 Conclusion: SurgiATM是一种高效、低成本、易于集成的去烟模块，显著提升了现有去烟模型的性能与稳定性，具有良好的临床适用性和推广价值。 Abstract: During laparoscopic surgery, smoke generated by tissue cauterization can significantly degrade the visual quality of endoscopic frames, increasing the risk of surgical errors and hindering both clinical decision-making and computer-assisted visual analysis. Consequently, removing surgical smoke is critical to ensuring patient safety and maintaining operative efficiency. In this study, we propose the Surgical Atmospheric Model (SurgiATM) for surgical smoke removal. SurgiATM statistically bridges a physics-based atmospheric model and data-driven deep learning models, combining the superior generalizability of the former with the high accuracy of the latter. Furthermore, SurgiATM is designed as a lightweight, plug-and-play module that can be seamlessly integrated into diverse surgical desmoking architectures to enhance their accuracy and stability, better meeting clinical requirements. It introduces only two hyperparameters and no additional trainable weights, preserving the original network architecture with minimal computational and modification overhead. We conduct extensive experiments on three public surgical datasets with ten desmoking methods, involving multiple network architectures and covering diverse procedures, including cholecystectomy, partial nephrectomy, and diaphragm dissection. The results demonstrate that incorporating SurgiATM commonly reduces the restoration errors of existing models and relatively enhances their generalizability, without adding any trainable layers or weights. This highlights the convenience, low cost, effectiveness, and generalizability of the proposed method. The code for SurgiATM is released at https://github.com/MingyuShengSMY/SurgiATM.

[74] Deep learning models are vulnerable, but adversarial examples are even more vulnerable

Jun Li,Yanwei Xu,Keran Li,Xiaoli Zhang

Main category: cs.CV

TL;DR: 该研究发现图像对抗样本在遮挡下表现出显著的敏感性，并提出了一种新的度量方法SMCE来量化模型置信度波动，进而设计了基于滑动窗口遮挡的对抗样本检测方法SWM-AED，在CIFAR-10上实现了高达96.5%的检测准确率。

Details

Motivation: 理解对抗样本与干净样本之间的本质差异对于提升深度神经网络的鲁棒性和对抗攻击检测能力至关重要。 Method: 通过在CIFAR-10上使用九种典型攻击方法生成对抗样本，引入滑动掩码置信熵（SMCE）来量化遮挡下的模型置信度波动，并提出基于滑动窗口掩码的对抗样本检测方法（SWM-AED）。 Result: 实验表明对抗样本在遮挡下具有更高的置信度波动，SWM-AED在多种分类器和攻击下表现稳健，检测准确率多数情况下超过62%，最高达96.5%。 Conclusion: 利用遮挡引起的置信度波动可有效区分对抗样本与干净样本，SWM-AED为对抗样本检测提供了一种鲁棒且无需对抗训练的新思路。 Abstract: Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.

[75] A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

Ruolin Li,Min Liu,Yuan Bian,Zhaoyang Li,Yuzhen Li,Xueping Wang,Yaonan Wang

Main category: cs.CV

TL;DR: 提出双阶段提示驱动的隐私保护范式DPPP，通过扩散模型生成大规模虚拟数据集GenePerson，并利用提示解耦机制PDM提升域不变特征学习，显著提高行人重识别模型的泛化性能。

Details

Motivation: 现有基于游戏引擎生成的虚拟数据集存在构建复杂和域泛化能力差的问题，难以应用于真实场景，因此需要一种更高效、更具泛化能力的虚拟数据生成与学习方法来解决数据隐私和模型迁移问题。 Method: 第一阶段利用包含多维度属性的丰富提示词驱动扩散模型端到端合成多样化行人图像，构建大规模虚拟数据集GenePerson；第二阶段提出提示驱动解耦机制（PDM），结合对比学习和文本逆向网络将图像映射为风格与内容的伪词，构造风格解耦的内容提示以引导模型学习域不变的内容特征。 Result: 在GenePerson上结合PDM训练的模型在多个主流真实和虚拟Re-ID数据集上均取得最先进的泛化性能，显著优于现有方法。 Conclusion: DPPP框架有效解决了虚拟数据构建复杂和域泛化差的问题，为隐私保护下的行人重识别提供了高效可行的新范式。 Abstract: With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.

[76] Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start

Fuyang Liu,Jiaqi Xu,Xiaowei Hu

Main category: cs.CV

TL;DR: 提出一种基于物理驱动的高保真数据集HFLS-Weather和双层强化学习框架，用于提升视觉模型在复杂恶劣天气下的自适应恢复能力。

Details

Motivation: 现有基于合成数据训练的视觉模型在面对复杂多变的恶劣天气退化时泛化能力不足，缺乏有效的无监督自适应机制。 Method: 构建HFLS-Weather数据集以模拟多种天气现象；设计双层强化学习框架：局部层次通过扰动驱动的图像质量优化来精炼天气特定的恢复模型，无需配对监督；全局层次由元控制器动态协调模型选择与执行顺序。 Result: 该框架实现了在真实世界条件下的持续自适应，在多种恶劣天气场景中达到最先进性能。 Conclusion: 所提出的双层强化学习框架结合高保真仿真数据，有效提升了模型在未见天气退化下的泛化与自适应能力，推动了无监督图像恢复在现实应用中的发展。 Abstract: Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at https://github.com/xxclfy/AgentRL-Real-Weather

[77] Early Alzheimer's Disease Detection from Retinal OCT Images: A UK Biobank Study

Yasemin Turkan,F. Boray Tek,M. Serdar Nazlı,Öykü Eren

Main category: cs.CV

TL;DR: 本研究首次应用深度学习对原始OCT B扫描图像进行阿尔茨海默病（AD）早期预测，使用ResNet-34等预训练模型，在UK Biobank数据集上实现了0.62的AUC，虽未达临床应用标准，但可解释性分析证实了AD与对照组在黄斑中心凹区域存在结构性差异，为基于OCT的AD预测提供了基线参考。

Details

Motivation: 阿尔茨海默病的早期检测极具挑战性，因影像变化早于临床诊断多年。传统研究多依赖分层厚度测量，而直接利用原始OCT图像进行分类可能捕捉更细微的神经退行性变化，因此探索基于深度学习的OCT图像直接分类方法具有重要意义。 Method: 采用多种预训练深度学习模型（包括基于ImageNet的网络和专用于OCT的RETFound transformer），对来自UK Biobank队列的OCT B-scan图像进行微调与评估；使用受试者层面的交叉验证，并匹配年龄、性别和成像实例；应用标准及OCT特异性数据增强技术以减少小样本高维数据中的过拟合；引入年加权损失函数，优先关注成像后四年内确诊的病例。 Result: ResNet-34模型在四年随访组中表现最稳定，AUC达到0.62；尽管性能尚未达到临床应用水平，但可解释性分析显示AD组与对照组在中央黄斑亚区存在局部结构差异。 Conclusion: 这是首个利用原始OCT B-scan图像进行AD早期预测的深度学习研究，结果表明直接图像分类面临挑战，特别是在诊断前多年检测细微生物标志物时；当前性能有限，凸显了需要更大规模数据集和多模态融合策略来提升预测能力。 Abstract: Alterations in retinal layer thickness, measurable using Optical Coherence Tomography (OCT), have been associated with neurodegenerative diseases such as Alzheimer's disease (AD). While previous studies have mainly focused on segmented layer thickness measurements, this study explored the direct classification of OCT B-scan images for the early detection of AD. To our knowledge, this is the first application of deep learning to raw OCT B-scans for AD prediction in the literature. Unlike conventional medical image classification tasks, early detection is more challenging than diagnosis because imaging precedes clinical diagnosis by several years. We fine-tuned and evaluated multiple pretrained models, including ImageNet-based networks and the OCT-specific RETFound transformer, using subject-level cross-validation datasets matched for age, sex, and imaging instances from the UK Biobank cohort. To reduce overfitting in this small, high-dimensional dataset, both standard and OCT-specific augmentation techniques were applied, along with a year-weighted loss function that prioritized cases diagnosed within four years of imaging. ResNet-34 produced the most stable results, achieving an AUC of 0.62 in the 4-year cohort. Although below the threshold for clinical application, our explainability analyses confirmed localized structural differences in the central macular subfield between the AD and control groups. These findings provide a baseline for OCT-based AD prediction, highlight the challenges of detecting subtle retinal biomarkers years before AD diagnosis, and point to the need for larger datasets and multimodal approaches.

[78] SnowyLane: Robust Lane Detection on Snow-covered Rural Roads Using Infrastructural Elements

Jörg Gamerdinger,Benedict Wetzel,Patrick Schulz,Sven Teufel,Oliver Bringmann

Main category: cs.CV

TL;DR: 提出一种不依赖传统车道线、通过检测路边分隔柱作为间接车道指示的鲁棒实时车道检测方法，并发布包含8万帧标注数据的合成雪天数据集SnowyLane。

Details

Motivation: 在积雪环境下，传统车道线常被遮挡或消失，导致现有车道检测方法失效，因此需要一种不依赖车道线标记的鲁棒检测方案。 Method: 通过感知路边的分隔柱，利用参数化贝塞尔曲线模型拟合平滑车道轨迹，结合空间一致性和道路几何特征进行车道推断。 Result: 在雪天和强遮挡条件下显著优于现有最先进方法，表现出更强的鲁棒性；发布了SnowyLane数据集用于训练和评估。 Conclusion: 该方法为冬季场景下的可靠车道检测提供了有效解决方案，且SnowyLane数据集为全天候自动驾驶研究提供了重要资源。 Abstract: Lane detection for autonomous driving in snow-covered environments remains a major challenge due to the frequent absence or occlusion of lane markings. In this paper, we present a novel, robust and realtime capable approach that bypasses the reliance on traditional lane markings by detecting roadside features,specifically vertical roadside posts called delineators, as indirect lane indicators. Our method first perceives these posts, then fits a smooth lane trajectory using a parameterized Bezier curve model, leveraging spatial consistency and road geometry. To support training and evaluation in these challenging scenarios, we introduce SnowyLane, a new synthetic dataset containing 80,000 annotated frames capture winter driving conditions, with varying snow coverage, and lighting conditions. Compared to state-of-the-art lane detection systems, our approach demonstrates significantly improved robustness in adverse weather, particularly in cases with heavy snow occlusion. This work establishes a strong foundation for reliable lane detection in winter scenarios and contributes a valuable resource for future research in all-weather autonomous driving. The dataset is available at https://ekut-es.github.io/snowy-lane

[79] From Linear Probing to Joint-Weighted Token Hierarchy: A Foundation Model Bridging Global and Cellular Representations in Biomarker Detection

Jingsong Liu,Han Li,Nassir Navab,Peter J. Schüffler

Main category: cs.CV

TL;DR: JWTH是一种新型病理基础模型，通过结合细胞中心的后调优和注意力池化，融合局部与全局信息，在多个生物标志物检测任务中表现优于现有模型。

Details

Motivation: 现有病理基础模型多依赖于全局patch级嵌入，忽略细胞级形态学信息，限制了AI生物标志物的可解释性与准确性。 Method: 提出JWTH模型，采用大规模自监督预训练，结合细胞中心的微调策略和注意力池化机制，整合细胞级与组织级特征。 Result: 在四个生物标志物、八个队列的四项任务中，JWTH最高提升8.3%的平衡准确率，平均提升1.2%，优于先前模型。 Conclusion: JWTH通过融合细胞级形态信息，提升了AI生物标志物检测的性能与可解释性，推动了数字病理学中基础模型的发展。 Abstract: AI-based biomarkers can infer molecular features directly from hematoxylin & eosin (H&E) slides, yet most pathology foundation models (PFMs) rely on global patch-level embeddings and overlook cell-level morphology. We present a PFM model, JWTH (Joint-Weighted Token Hierarchy), which integrates large-scale self-supervised pretraining with cell-centric post-tuning and attention pooling to fuse local and global tokens. Across four tasks involving four biomarkers and eight cohorts, JWTH achieves up to 8.3% higher balanced accuracy and 1.2% average improvement over prior PFMs, advancing interpretable and robust AI-based biomarker detection in digital pathology.

[80] Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

Adrian Azzarelli,Nantheera Anantrasirichai,David R Bull

Main category: cs.CV

TL;DR: 提出一种将高斯点和形变场分为前景和背景的可变形高斯点阵方法，通过稀疏掩码实现高质量动态3D重建，优于现有方法。

Details

Motivation: 现有方法在稀疏相机配置下难以捕捉复杂的动态特征，限制了电影制作中的应用。 Method: 将规范高斯点和形变场分离为前景和背景，使用稀疏掩码进行预训练，并分别建模不同参数以适应动态变化。 Result: 在3D和2.5D数据集上实现了最先进的定性和定量结果，PSNR最高提升3，模型大小减半；能生成透明和动态纹理的分段重建。 Conclusion: 该方法在稀疏输入条件下显著提升了动态3D重建质量，无需密集掩码监督，适用于实际电影制作场景。 Abstract: Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: https://interims-git.github.io/

[81] Another BRIXEL in the Wall: Towards Cheaper Dense Features

Alexander Lappe,Martin A. Giese

Main category: cs.CV

TL;DR: 提出BRIXEL，一种简单的知识蒸馏方法，通过让学生模型学习生成更高分辨率的特征图来提升性能，在固定分辨率下显著优于DINOv3，并以较低计算成本生成类似教师模型的特征图。

Details

Motivation: DINOv3模型需要高分辨率输入和大量计算资源，限制了其在实际应用中的效率和可扩展性。 Method: 采用知识蒸馏方法，让学生模型学习自我生成更高分辨率的特征图，从而在不增加输入分辨率的情况下提升特征质量。 Result: BRIXEL在下游任务中显著优于DINOv3基线模型，且生成的特征图与教师模型相似，计算成本大幅降低。 Conclusion: BRIXEL是一种高效、简洁的方法，能够在保持低计算开销的同时，生成高质量的密集特征图，适用于实际应用场景。 Abstract: Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.

[82] MUSE: Multi-Scale Dense Self-Distillation for Nucleus Detection and Classification

Zijiang Yang,Hanqing Chao,Bokai Zhao,Yelin Yang,Yunshuo Zhang,Dongmei Fu,Junping Zhang,Le Lu,Ke Yan,Dakai Jin,Minfeng Xu,Yun Bian,Hui Jiang

Main category: cs.CV

TL;DR: 提出了一种名为MUSE的自监督学习方法，用于组织病理学中的核检测与分类，通过NuLo机制实现基于坐标的局部自蒸馏，无需严格的空间对齐，提升了模型在细粒度核表示上的能力。

Details

Motivation: 现有方法依赖大量人工标注的核级标签，且难以充分利用大规模无标签数据来学习判别性核表示。 Method: 提出MUSE框架，核心是NuLo（基于核的局部自蒸馏）机制，利用预测的核位置进行灵活的局部自蒸馏；设计了编码器-解码器结构和大视野半监督微调策略，以最大化利用无标签病理图像。 Result: 在三个广泛使用的基准上实验表明，MUSE不仅超越了现有的监督基线方法，还优于通用的病理基础模型。 Conclusion: MUSE有效解决了组织病理学核检测与分类中的核心挑战，在减少标注依赖的同时显著提升了核级表征能力。 Abstract: Nucleus detection and classification (NDC) in histopathology analysis is a fundamental task that underpins a wide range of high-level pathology applications. However, existing methods heavily rely on labor-intensive nucleus-level annotations and struggle to fully exploit large-scale unlabeled data for learning discriminative nucleus representations. In this work, we propose MUSE (MUlti-scale denSE self-distillation), a novel self-supervised learning method tailored for NDC. At its core is NuLo (Nucleus-based Local self-distillation), a coordinate-guided mechanism that enables flexible local self-distillation based on predicted nucleus positions. By removing the need for strict spatial alignment between augmented views, NuLo allows critical cross-scale alignment, thus unlocking the capacity of models for fine-grained nucleus-level representation. To support MUSE, we design a simple yet effective encoder-decoder architecture and a large field-of-view semi-supervised fine-tuning strategy that together maximize the value of unlabeled pathology images. Extensive experiments on three widely used benchmarks demonstrate that MUSE effectively addresses the core challenges of histopathological NDC. The resulting models not only surpass state-of-the-art supervised baselines but also outperform generic pathology foundation models.

[83] Walk the Lines 2: Contour Tracking for Detailed Segmentation

André Peter Kelm,Max Braeschke,Emre Gülsoylu,Simone Frintrop

Main category: cs.CV

TL;DR: 本文提出了Walk the Lines 2（WtL2），一种专用于红外图像船舶和RGB图像多类物体精细分割的轮廓跟踪算法，扩展了原有仅适用于彩色图像船舶分割的WtL方法。WtL2通过优化轮廓检测并生成1像素宽的闭合形状，替代传统的非极大值抑制（NMS），在前景-背景场景中实现可分割区域，在闭合轮廓生成和细节保留方面优于最新轮廓方法，具有高峰值IoU，适用于需要高质量分割的专用场景。

Details

Motivation: 原始的Walk the Lines（WtL）方法仅适用于彩色图像中的船舶分割，限制了其应用范围。为了提升在红外图像中对船舶的分割能力，并扩展至RGB图像中的多种物体，需要一种更通用且细节表现更强的轮廓跟踪算法。此外，传统非极大值抑制（NMS）在生成精确闭合轮廓方面存在不足，亟需更优的后处理方法。 Method: WtL2基于原有的WtL框架，改进其输入端以适配红外图像中的船舶轮廓检测，并增强算法对多种RGB物体的适应性。该方法采用轮廓跟踪机制，逐步优化检测框输出的初始轮廓，直至形成1像素宽的闭合边界，再通过二值化生成可用于分割的区域。该过程替代了传统的NMS，提升了轮廓的精确性和细节保留能力。 Result: 实验表明，WtL2在红外图像船舶分割和多种RGB物体轮廓提取任务中均表现出色，尤其在生成闭合轮廓方面优于最新的轮廓-based方法，实现了更高的峰值交并比（IoU），并保留了丰富的细节信息。 Conclusion: WtL2成功扩展了WtL算法的应用范围至红外图像和多类RGB物体，通过创新的轮廓跟踪机制替代传统NMS，显著提升了分割精度与细节表现，是一种适用于高精度分割需求场景的有效方法，有望推动特定领域图像分割技术的发展。 Abstract: This paper presents Walk the Lines 2 (WtL2), a unique contour tracking algorithm specifically adapted for detailed segmentation of infrared (IR) ships and various objects in RGB.1 This extends the original Walk the Lines (WtL) [12], which focused solely on detailed ship segmentation in color. These innovative WtLs can replace the standard non-maximum suppression (NMS) by using contour tracking to refine the object contour until a 1-pixel-wide closed shape can be binarized, forming a segmentable area in foreground-background scenarios. WtL2 broadens the application range of WtL beyond its original scope, adapting to IR and expanding to diverse objects within the RGB context. To achieve IR segmentation, we adapt its input, the object contour detector, to IR ships. In addition, the algorithm is enhanced to process a wide range of RGB objects, outperforming the latest generation of contour-based methods when achieving a closed object contour, offering high peak Intersection over Union (IoU) with impressive details. This positions WtL2 as a compelling method for specialized applications that require detailed segmentation or high-quality samples, potentially accelerating progress in several niche areas of image segmentation.

[84] FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Jiang Lin,Xinyu Chen,Song Wu,Zhiqiu Zhang,Jizhi Zhang,Ye Wang,Qiang Tang,Qian Wang,Jian Yang,Zili Yi

Main category: cs.CV

TL;DR: 提出FreeControl，一种无需训练的扩散模型语义结构控制框架，通过单步注意力提取和潜在-条件解耦实现高效、高质量的图像生成控制。

Details

Motivation: 现有方法如ControlNet依赖手工设计的条件图和重训练，缺乏灵活性；基于反演的方法虽然对齐效果好，但推理成本高。需要一种无需训练且高效的结构控制方法。 Method: FreeControl在单个最优关键时间步进行一步注意力提取，并在整个去噪过程中复用该注意力；引入潜在-条件解耦（LCD）以分离关键时间步与噪声潜在变量，提升注意力质量并消除结构伪影；支持通过多源参考图像进行组合控制。 Result: 实现了训练自由、推理高效（仅增加约5%成本）、结构与语义对齐的图像生成；在视觉连贯性、提示对齐和组合控制方面优于现有方法。 Conclusion: FreeControl提供了一种新的测试时控制范式，兼容现代扩散模型，支持从原始图像直接生成结构清晰、语义一致的图像，并具备直观的场景布局设计能力。 Abstract: Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.

[85] 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

Mengqi Guo,Bo Xu,Yanyan Li,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出4D3R，一种无需相机姿态的动态神经渲染框架，通过两阶段方法解耦静态与动态成分，在真实动态场景中实现了比现有方法高1.8dB PSNR的重建质量，并将计算成本降低5倍。

Details

Motivation: 单目视频中未知相机姿态下的动态场景新视角合成仍是计算机视觉与图形学中的难题，现有NeRF和3DGS方法难以处理动态内容且依赖预估姿态。 Method: 首先利用3D基础模型进行初始位姿与几何估计，再进行运动感知优化；引入运动感知BA模块结合Transformer先验与SAM2实现动态对象分割，并提出运动感知高斯点阵表示法，采用控制点、变形场MLP与线性混合蒙皮建模动态运动。 Result: 在真实动态数据集上实验表明，相比最先进方法PSNR提升达1.8dB，尤其在大动态物体场景表现优异，同时计算开销减少5倍。 Conclusion: 4D3R有效解决了未知姿态下单目动态场景的新视角合成问题，在保持高质量重建的同时显著提升了效率与鲁棒性。 Abstract: Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.

[86] ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining

Xincheng Yao,Yan Luo,Zefeng Qian,Chongyang Zhang

Main category: cs.CV

TL;DR: 提出了一种专为工业异常检测设计的预训练表示学习框架，通过角度和范数导向的对比损失，在大规模异常检测数据集RealIAD上学习鲁棒且具有判别性的特征表示，有效缓解了ImageNet预训练与异常检测任务之间的目标不匹配和分布偏移问题。

Details

Motivation: 现有基于ImageNet预训练的特征在异常检测任务中存在目标不匹配和分布偏移问题，导致性能受限，亟需专为异常检测设计的预训练表示。 Method: 围绕异常检测的目标，设计了角度和范数导向的对比损失函数，并在RealIAD大规模异常检测数据集上进行预训练，使用残差特征提升跨数据集的泛化能力。 Result: 在五个异常检测数据集和五种主干网络上，基于所提预训练表示的五种嵌入式方法均表现出显著性能提升，验证了方法的有效性和通用性。 Conclusion: 专为异常检测设计的预训练框架能够生成更优的特征表示，显著提升下游异常检测任务的性能，推动了该领域的发展。 Abstract: The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn't aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at https://github.com/xcyao00/ADPretrain.

[87] Accurate online action and gesture recognition system using detectors and Deep SPD Siamese Networks

Mohamed Sanim Akremi,Rim Slama,Hedi Tabia

Main category: cs.CV

TL;DR: 本文提出了一种基于SPD矩阵表示和Siamese网络的在线骨架序列识别系统，包含检测器和分类器两个组件，能够在未分割的动作序列中连续识别动作区间并准确分类，在手势和身体动作识别任务上优于现有方法。

Details

Motivation: 由于现有基于骨架的动作识别研究多集中于分段识别，难以适用于实际在线场景，因此需要一种能够处理连续流式骨架数据的在线识别方法。 Method: 提出一个由检测器和分类器组成的在线识别系统，使用半正定（SPD）矩阵对骨架数据进行统计建模，并通过Siamese网络学习其语义相似性，实现动作区间的检测与分类。 Result: 在多个手部手势和身体动作识别基准上进行了广泛实验，结果表明该方法在多数情况下优于当前最先进的方法。 Conclusion: 所提出的基于SPD矩阵和Siamese网络的在线识别框架具有良好的灵活性和准确性，能够有效实现连续动作流中的实时动作检测与识别。 Abstract: Online continuous motion recognition is a hot topic of research since it is more practical in real life application cases. Recently, Skeleton-based approaches have become increasingly popular, demonstrating the power of using such 3D temporal data. However, most of these works have focused on segment-based recognition and are not suitable for the online scenarios. In this paper, we propose an online recognition system for skeleton sequence streaming composed from two main components: a detector and a classifier, which use a Semi-Positive Definite (SPD) matrix representation and a Siamese network. The powerful statistical representations for the skeletal data given by the SPD matrices and the learning of their semantic similarity by the Siamese network enable the detector to predict time intervals of the motions throughout an unsegmented sequence. In addition, they ensure the classifier capability to recognize the motion in each predicted interval. The proposed detector is flexible and able to identify the kinetic state continuously. We conduct extensive experiments on both hand gesture and body action recognition benchmarks to prove the accuracy of our online recognition system which in most cases outperforms state-of-the-art performances.

[88] Automatic segmentation of colorectal liver metastases for ultrasound-based navigated resection

Tiziano Natali,Karin A. Olthof,Niels F. M. Kok,Koert F. D. Kuhlmann,Theo J. M. Ruers,Matteo Fusaglia

Main category: cs.CV

TL;DR: 本研究提出了一种基于裁剪3D U-Net的自动分割方法，用于术中三维超声（iUS）下结直肠癌肝转移瘤（CRLM）的精准分割，实现了接近专家水平的准确性和近实时的运行速度。

Details

Motivation: 术中准确勾画CRLM边界对实现阴性切缘至关重要，但传统iUS因对比度低、噪声多和操作者依赖性强而难以精确分割，亟需自动化方法提升导航精度与效率。 Method: 采用nnU-Net框架构建3D U-Net模型，比较全体积输入与肿瘤区域裁剪后输入两种策略，在85例患者数据上训练并评估；将最优模型集成至3D Slicer平台用于实时术中导航。 Result: 裁剪模型显著优于全体积模型（AUC-ROC 0.898 vs 0.718），中位DSC达0.74，HDist.为17.1 mm，召回率为0.79，处理时间约1分钟，速度约为半自动方法的4倍；前瞻性术中测试验证了其稳定性和临床可用性。 Conclusion: 基于裁剪的3D U-Net可实现CRLM在iUS中的可靠、近实时自动分割，显著减少人工负担和手术时间，支持无需配准的高效超声导航肝切除手术。 Abstract: Introduction: Accurate intraoperative delineation of colorectal liver metastases (CRLM) is crucial for achieving negative resection margins but remains challenging using intraoperative ultrasound (iUS) due to low contrast, noise, and operator dependency. Automated segmentation could enhance precision and efficiency in ultrasound-based navigation workflows. Methods: Eighty-five tracked 3D iUS volumes from 85 CRLM patients were used to train and evaluate a 3D U-Net implemented via the nnU-Net framework. Two variants were compared: one trained on full iUS volumes and another on cropped regions around tumors. Segmentation accuracy was assessed using Dice Similarity Coefficient (DSC), Hausdorff Distance (HDist.), and Relative Volume Difference (RVD) on retrospective and prospective datasets. The workflow was integrated into 3D Slicer for real-time intraoperative use. Results: The cropped-volume model significantly outperformed the full-volume model across all metrics (AUC-ROC = 0.898 vs 0.718). It achieved median DSC = 0.74, recall = 0.79, and HDist. = 17.1 mm comparable to semi-automatic segmentation but with ~4x faster execution (~ 1 min). Prospective intraoperative testing confirmed robust and consistent performance, with clinically acceptable accuracy for real-time surgical guidance. Conclusion: Automatic 3D segmentation of CRLM in iUS using a cropped 3D U-Net provides reliable, near real-time results with minimal operator input. The method enables efficient, registration-free ultrasound-based navigation for hepatic surgery, approaching expert-level accuracy while substantially reducing manual workload and procedure time.

[89] OregairuChar: A Benchmark Dataset for Character Appearance Frequency Analysis in My Teen Romantic Comedy SNAFU

Qi Sun,Dingju Zhou,Lina Zhang

Main category: cs.CV

TL;DR: OregairuChar是一个用于分析动漫《我的青春恋爱物语果然有问题》第三季中角色出场频率的基准数据集，包含1600帧和2860个标注框，涵盖11个主要角色，支持对角色 prominence 及叙事动态的细粒度研究。

Details

Motivation: 为了深入理解动漫中的叙事结构和角色重要性，需要一个专注于角色视觉出场频率的高质量数据集，以应对遮挡、姿态变化和角色相似性等挑战。 Method: 构建了一个名为OregairuChar的数据集，包含1600帧手工标注图像和2860个边界框，覆盖11个主要角色，并基于多个目标检测模型进行基准测试，利用其预测结果进行剧集级别的角色出现频率分析。 Result: 成功实现了对角色出场频率的量化分析，揭示了角色 prominence 随剧情发展的变化模式，并验证了该数据集在计算叙事分析中的有效性。 Conclusion: OregairuChar为研究风格化媒体中的角色中心叙事和计算叙事动态提供了一个有价值的资源，推动了基于外观频率的动漫分析发展。 Abstract: The analysis of character appearance frequency is essential for understanding narrative structure, character prominence, and story progression in anime. In this work, we introduce OregairuChar, a benchmark dataset designed for appearance frequency analysis in the anime series My Teen Romantic Comedy SNAFU. The dataset comprises 1600 manually selected frames from the third season, annotated with 2860 bounding boxes across 11 main characters. OregairuChar captures diverse visual challenges, including occlusion, pose variation, and inter-character similarity, providing a realistic basis for appearance-based studies. To enable quantitative research, we benchmark several object detection models on the dataset and leverage their predictions for fine-grained, episode-level analysis of character presence over time. This approach reveals patterns of character prominence and their evolution within the narrative. By emphasizing appearance frequency, OregairuChar serves as a valuable resource for exploring computational narrative dynamics and character-centric storytelling in stylized media.

[90] DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong,Chenxiao Zhao,ChengLin Zhu,Weiheng Lu,Guohai Xu,Xing Yu

Main category: cs.CV

TL;DR: 本文介绍了DeepEyesV2，探讨了如何从数据构建、训练方法和模型评估角度构建具有主动调用外部工具能力的智能多模态模型。

Details

Motivation: 直接使用强化学习无法有效诱导出稳定的工具使用行为，因此需要新的训练策略和评估基准来推动多模态智能体模型的发展。 Method: 提出两阶段训练流程：冷启动阶段建立工具使用模式，强化学习阶段进一步优化工具调用；构建包含有益工具使用示例的多样化训练数据集，并引入RealX-Bench作为综合评估基准。 Result: DeepEyesV2在RealX-Bench及其他基准上表现出色，具备任务自适应的工具调用能力，能根据任务类型选择图像操作或数值计算，并通过强化学习实现复杂工具组合与上下文感知的工具选择。 Conclusion: 两阶段训练策略有效提升了多模态模型的工具使用能力，RealX-Bench为真实世界多模态推理提供了可靠评估方式，研究为未来智能多模态模型的开发提供了实践指导。 Abstract: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

[91] What's on Your Plate? Inferring Chinese Cuisine Intake from Wearable IMUs

Jiaxi Yin,Pengcheng Wang,Han Ding,Fei Wang

Main category: cs.CV

TL;DR: 提出CuisineSense系统，结合智能手表和智能眼镜的运动数据，实现对中国菜类型的准确分类与饮食监测。

Details

Motivation: 现有饮食监测方法存在回忆偏差、隐私问题或食物类型覆盖不足，尤其缺乏对多样化中式餐饮的有效识别。 Method: 利用智能手表的手部动作和智能眼镜的头部动态，设计两阶段检测流程：第一阶段识别进食状态，第二阶段进行细粒度食物分类。 Result: 构建包含10名参与者、11类食物、27.5小时IMU数据的数据集，实验表明系统在进食检测和食物分类上均具有高精度。 Conclusion: CuisineSense为无感、可穿戴的饮食监测提供了一种实用且有效的解决方案，特别适用于复杂的中式饮食环境。 Abstract: Accurate food intake detection is vital for dietary monitoring and chronic disease prevention. Traditional self-report methods are prone to recall bias, while camera-based approaches raise concerns about privacy. Furthermore, existing wearable-based methods primarily focus on a limited number of food types, such as hamburgers and pizza, failing to address the vast diversity of Chinese cuisine. To bridge this gap, we propose CuisineSense, a system that classifies Chinese food types by integrating hand motion cues from a smartwatch with head dynamics from smart glasses. To filter out irrelevant daily activities, we design a two-stage detection pipeline. The first stage identifies eating states by distinguishing characteristic temporal patterns from non-eating behaviors. The second stage then conducts fine-grained food type recognition based on the motions captured during food intake. To evaluate CuisineSense, we construct a dataset comprising 27.5 hours of IMU recordings across 11 food categories and 10 participants. Experiments demonstrate that CuisineSense achieves high accuracy in both eating state detection and food classification, offering a practical solution for unobtrusive, wearable-based dietary monitoring.The system code is publicly available at https://github.com/joeeeeyin/CuisineSense.git.

[92] Cross-domain EEG-based Emotion Recognition with Contrastive Learning

Rui Yan,Yibo Li,Han Ding,Fei Wang

Main category: cs.CV

TL;DR: 本文提出EmotionCLIP，将脑电图（EEG）情绪识别重构为EEG-文本匹配任务，结合SST-LegoViT模型提取多尺度时空谱特征，在SEED和SEED-IV数据集上实现了优于现有模型的跨被试和跨时间情绪识别准确率。

Details

Motivation: 基于EEG的情绪识别在情感计算中至关重要，但面临特征利用不足和跨域泛化能力差的问题，因此需要更鲁棒的多模态方法提升性能。 Method: 引入EmotionCLIP框架，采用CLIP中的对比学习思想，将EEG信号与文本配对；设计SST-LegoViT作为骨干网络，融合多尺度卷积和Transformer模块以捕获空间、频谱和时间特征。 Result: 在SEED和SEED-IV数据集上，EmotionCLIP取得了88.69%和73.50%的跨被试准确率，以及88.46%和77.54%的跨时间准确率，均优于现有模型。 Conclusion: 基于多模态对比学习的EEG-文本匹配方法能有效提升EEG情绪识别的鲁棒性和跨域泛化能力。 Abstract: Electroencephalogram (EEG)-based emotion recognition is vital for affective computing but faces challenges in feature utilization and cross-domain generalization. This work introduces EmotionCLIP, which reformulates recognition as an EEG-text matching task within the CLIP framework. A tailored backbone, SST-LegoViT, captures spatial, spectral, and temporal features using multi-scale convolution and Transformer modules. Experiments on SEED and SEED-IV datasets show superior cross-subject accuracies of 88.69% and 73.50%, and cross-time accuracies of 88.46% and 77.54%, outperforming existing models. Results demonstrate the effectiveness of multimodal contrastive learning for robust EEG emotion recognition.

[93] LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

Zhenyu Yang,Kairui Zhang,Yuhang Hu,Bing Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Weiming Dong,Changsheng Xu

Main category: cs.CV

TL;DR: 本文提出了LiveStar，一种用于实时视频理解的在线视频大语言模型，通过自适应流式解码实现持续主动响应，解决了传统模型在实时性和叙事连贯性上的不足。

Details

Motivation: 现有在线视频大语言模型难以同时处理连续帧输入并确定最佳响应时机，影响了实时响应能力和叙事一致性。 Method: LiveStar采用增量式视频-语言对齐训练策略、响应-静默解码框架以单次前向传播确定最佳响应时机，并结合峰值-末端记忆压缩和流式键值缓存进行内存优化加速。 Result: 在三个基准测试中，LiveStar相比现有模型语义正确性平均提升19.5%，响应时序差异减少18.1%，五项OmniStar任务中FPS提升12.0%。 Conclusion: LiveStar在长时在线视频理解中实现了更优的实时性、准确性和效率平衡，配合新构建的OmniStar数据集推动了该领域发展。 Abstract: Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

[94] Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Matteo Bastico,David Ryckelynck,Laurent Corté,Yannick Tillier,Etienne Decencière

Main category: cs.CV

TL;DR: 本文提出了一种新的点云生成模型评估指标SNC，并改进了现有基于Chamfer Distance的度量方法，引入对齐和密度感知的DCD；同时提出了基于Transformer的Diffusion Point Transformer架构，在ShapeNet上实现了生成质量的新SOTA。

Details

Motivation: 现有的点云生成评估指标（如Chamfer Distance）对缺陷不鲁棒，难以准确反映几何保真度和局部形状一致性，亟需更可靠的评估方法。 Method: 引入样本对齐和密度感知的DCD替代传统CD；提出基于表面法向相似性的新指标SNC；结合序列化patch注意力机制，设计Diffusion Point Transformer生成模型。 Result: 在ShapeNet数据集上实验表明，所提指标更具鲁棒性和一致性，所提模型在生成点云质量上优于先前方法，达到SOTA水平。 Conclusion: 通过改进评估指标和生成模型，显著提升了点云生成的质量与评估可靠性，为未来研究提供了更优的评测标准和网络架构。 Abstract: As 3D point clouds become a cornerstone of modern technology, the need for sophisticated generative models and reliable evaluation metrics has grown exponentially. In this work, we first expose that some commonly used metrics for evaluating generated point clouds, particularly those based on Chamfer Distance (CD), lack robustness against defects and fail to capture geometric fidelity and local shape consistency when used as quality indicators. We further show that introducing samples alignment prior to distance calculation and replacing CD with Density-Aware Chamfer Distance (DCD) are simple yet essential steps to ensure the consistency and robustness of point cloud generative model evaluation metrics. While existing metrics primarily focus on directly comparing 3D Euclidean coordinates, we present a novel metric, named Surface Normal Concordance (SNC), which approximates surface similarity by comparing estimated point normals. This new metric, when combined with traditional ones, provides a more comprehensive evaluation of the quality of generated samples. Finally, leveraging recent advancements in transformer-based models for point cloud analysis, such as serialized patch attention , we propose a new architecture for generating high-fidelity 3D structures, the Diffusion Point Transformer. We perform extensive experiments and comparisons on the ShapeNet dataset, showing that our model outperforms previous solutions, particularly in terms of quality of generated point clouds, achieving new state-of-the-art. Code available at https://github.com/matteo-bastico/DiffusionPointTransformer.

[95] $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models

Huanqi Wu,Huangbiao Xu,Runfeng Xie,Jiaxin Cai,Kaixin Zhang,Xiao Ke

Main category: cs.CV

TL;DR: 本文提出了句子到图像的隐写术（Sentence-to-Image Steganography），利用大语言模型（LLM）将语义丰富的文本信息嵌入图像中，并建立了Invisible Text（IVT）基准进行评估。

Details

Motivation: 传统隐写术难以在载体中嵌入语义丰富的句子级信息，而AIGC时代对隐写容量提出了更高要求，因此需要发展能够处理高层语义的新型隐写技术。 Method: 提出S²LM（Semantic Steganographic Language Model），通过引入大语言模型全程参与的新管道，将句子或段落等高级文本信息语义化地嵌入图像；同时构建了包含多样化句子级文本的Invisible Text（IVT）基准用于评估。 Result: 实验表明，S²LM在定量和定性评估中均能有效实现语义丰富的隐写，显著优于传统位级方法，释放了LLM在语义隐写中的新能力。 Conclusion: S²LM成功实现了将自然语言级别的语义信息嵌入图像，在语义隐写领域迈出了关键一步，为未来LLM与隐写技术的融合提供了新方向。 Abstract: Although steganography has made significant advancements in recent years, it still struggles to embed semantically rich, sentence-level information into carriers. However, in the era of AIGC, the capacity of steganography is more critical than ever. In this work, we present Sentence-to-Image Steganography, an instance of Semantic Steganography, a novel task that enables the hiding of arbitrary sentence-level messages within a cover image. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages for evaluation. Finally, we present $\mathbf{S^2LM}$: Semantic Steganographic Language Model, which utilizes large language models (LLMs) to embed high-level textual information, such as sentences or even paragraphs, into images. Unlike traditional bit-level counterparts, $\mathrm{S^2LM}$ enables the integration of semantically rich content through a newly designed pipeline in which the LLM is involved throughout the entire process. Both quantitative and qualitative experiments demonstrate that our method effectively unlocks new semantic steganographic capabilities for LLMs. The source code will be released soon.

[96] Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

Manuel Gomes,Bogdan Raducanu,Miguel Oliveira

Main category: cs.CV

TL;DR: 本文提出了Artic4D数据集和CanonSeg4D框架，用于提升4D动态关节点物体的全景分割性能，通过引入时间动态建模和规范空间对齐，在复杂场景中实现了优于现有方法的分割精度。

Details

Motivation: 现有方法大多忽略关节点物体的时序动态特性，且缺乏适用于4D全景分割的基准数据集，限制了该领域的发展。 Method: 基于PartNet Mobility构建了包含4D全景标注和关节参数的Artic4D数据集，并提出CanonSeg4D框架，通过估计每帧到规范空间的偏移量实现跨帧一致的部件级分割与对齐。 Result: 在Artic4D上的实验表明，CanonSeg4D在复杂场景下的全景分割精度优于现有最先进方法，验证了时序建模和规范对齐的有效性。 Conclusion: 引入时间信息和规范空间对齐能显著提升关节点物体的感知能力，为4D关节点物体理解提供了新的数据支持和方法路径。 Abstract: Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.

[97] Dense Motion Captioning

Shiyao Xu,Benedetta Liberatori,Gül Varol,Paolo Rota

Main category: cs.CV

TL;DR: 本文提出了密集动作描述任务（Dense Motion Captioning），旨在对3D人体动作序列中的动作进行时间定位与描述，并发布了首个大规模复杂动作数据集CompMo及配套模型DEMO。

Details

Motivation: 现有研究多集中于文本到动作生成，而动作理解任务相对滞后；当前数据集缺乏详细的时间标注且动作序列较短，难以支持复杂动作的理解。 Method: 构建了包含6万条多动作序列的CompMo数据集，每条序列均带有精确的时间边界标注；提出DEMO模型，结合大语言模型与轻量级动作适配器，实现对3D动作序列的密集时序描述生成。 Result: 实验表明，DEMO在CompMo及改编基准上显著优于现有方法，实现了对复杂3D动作序列的准确时序定位与自然语言描述。 Conclusion: DEMO为3D动作理解与描述任务建立了强有力的基线，CompMo数据集填补了复杂、细粒度动作标注数据的空白，推动了动作理解领域的发展。 Abstract: Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.

[98] PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Zehui Feng,Tian Qiu,Tong Wu,Junxuan Li,Huayuan Xu,Ting Han

Main category: cs.CV

TL;DR: 提出PreResQ-R1，一种基于偏好-响应解耦的强化学习框架，统一绝对评分回归与相对排序一致性，在图像和视频质量评估中实现SOTA性能。

Details

Motivation: 现有视觉质量评估方法依赖监督微调或仅排序目标，导致推理浅层、评分校准差、跨域泛化能力弱。 Method: 提出PreResQ-R1，采用双分支奖励机制，分别建模样本内响应一致性和样本间偏好对齐，通过Group Relative Policy Optimization（GRPO）优化，并设计全局-时序与局部-空间数据流策略用于视频质量评估。 Result: 在6K图像和28K视频上微调后，PreResQ-R1在10个IQA和5个VQA基准上达到SOTA，IQA任务SRCC和PLCC指标分别提升5.30%和2.15%，并生成与人类对齐的可解释推理轨迹。 Conclusion: PreResQ-R1通过解耦偏好与响应的强化学习框架，实现了更细粒度、稳定且可解释的质量评估推理，在图像与视频质量评估中表现出优异的性能与泛化能力。 Abstract: Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

[99] AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly

Alexander Htet Kyaw,Haotian Ma,Sasa Zivkovic,Jenny Sabin

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的增强现实（AR）辅助组装工作流程，通过物体识别技术实现组件自动识别与实时指引。

Details

Motivation: 减少人工查找、分类或标记组件的需求，提高装配效率和准确性。 Method: 采用深度学习驱动的物体识别技术，在物理空间中为每个装配步骤显示对应组件的边界框及其安装位置。 Result: 系统能够实时连接装配指令与相关组件的位置，成功应用于乐高雕塑的组装案例研究中。 Conclusion: 验证了利用物体识别技术实现AR辅助装配的可行性，具有在实际制造和装配场景中的应用潜力。 Abstract: We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.

[100] PALM: A Dataset and Baseline for Learning Multi-subject Hand Prior

Zicong Fan,Edoardo Remelli,David Dimond,Fadime Sener,Liuhao Ge,Bugra Tekin,Cem Keskin,Shreyas Hampali

Main category: cs.CV

TL;DR: 提出了PALM数据集，包含263名受试者的1.3万个高质量手部扫描和9万个多视角图像，用于推动基于单张图像的个性化手部头像建模研究。

Details

Motivation: 由于复杂几何、外观变化、光照不确定性和视角有限，从图像生成高质量个性化手部头像具有挑战性，且缺乏同时提供精确3D几何、高分辨率多视角图像和多样化人群的数据集。 Method: 构建了大规模数据集PALM，并提出基于物理的逆渲染方法学习多主体手部几何与材质先验的基准模型PALM-Net，实现单图生成可重光照的逼真个性化手部头像。 Result: PALM数据集涵盖丰富的皮肤色调、年龄和几何多样性；PALM-Net能有效实现单图像驱动的高质量、可重光照的手部头像个性化。 Conclusion: PALM数据集及其基准方法为手部建模及相关研究提供了宝贵的现实世界资源，显著推动了个性化手部头像生成的发展。 Abstract: The ability to grasp objects, signal with gestures, and share emotion through touch all stem from the unique capabilities of human hands. Yet creating high-quality personalized hand avatars from images remains challenging due to complex geometry, appearance, and articulation, particularly under unconstrained lighting and limited views. Progress has also been limited by the lack of datasets that jointly provide accurate 3D geometry, high-resolution multiview imagery, and a diverse population of subjects. To address this, we present PALM, a large-scale dataset comprising 13k high-quality hand scans from 263 subjects and 90k multi-view images, capturing rich variation in skin tone, age, and geometry. To show its utility, we present a baseline PALM-Net, a multi-subject prior over hand geometry and material properties learned via physically based inverse rendering, enabling realistic, relightable single-image hand avatar personalization. PALM's scale and diversity make it a valuable real-world resource for hand modeling and related research.

Laura Alejandra Encinar Gonzalez,John Folkesson,Rudolph Triebel,Riccardo Giubilato

Main category: cs.CV

TL;DR: 本文提出MPRF，一种基于多模态基础模型的环路闭合检测方法，结合视觉与LiDAR模态，在无GNSS环境下实现鲁棒的位姿估计，优于现有检索方法，并在低纹理区域表现出更强的鲁棒性。

Details

Motivation: 在GNSS拒止环境中（如行星探测），传统视觉和LiDAR方法因纹理缺失、稀疏性和歧义性导致环路闭合检测困难，亟需更鲁棒的解决方案。 Method: MPRF采用两阶段视觉检索策略，结合DINOv2特征与SALAD聚合进行候选筛选，并利用SONATA-based LiDAR描述子进行几何验证，同时引入6-DoF位姿估计，提升定位精度与可解释性。 Result: 在S3LI和S3LI Vulcano数据集上，MPRF在精度上优于现有检索方法，并在低纹理区域显著提升位姿估计的鲁棒性，且提供适用于SLAM后端的可解释匹配结果。 Conclusion: MPRF通过融合视觉与LiDAR基础模型，统一了地点识别与位姿估计，在准确性、效率和可靠性之间取得良好平衡，展示了基础模型在极端非结构化环境中SLAM应用的潜力。 Abstract: Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.

Aupendu Kar,Krishnendu Ghosh,Prabir Kumar Biswas

Main category: cs.CV

TL;DR: 提出了一种基于卷积层简单修改的持续图像恢复方法，无需改变主干网络结构，可无缝集成到任意深度架构中，在不显著增加计算开销的情况下有效应对持续学习中的遗忘问题，并通过知识迁移提升了新任务性能。

Details

Motivation: 现有持续学习方法在图像恢复领域面临大尺寸图像和多样化退化类型的挑战，且通常需要复杂的架构修改，导致计算开销大；正则化方法不适用于需不同特征处理的不同恢复任务。 Method: 对卷积层进行简单修改，引入可学习的适配模块以适应新任务，同时保持主干网络不变，实现参数高效扩展和知识保留，支持无缝集成到任何深度网络。 Result: 实验表明该方法在新增恢复任务时不影响原有任务性能，且通过利用先前任务构建的知识库提升了新任务的表现，参数增长的同时未显著增加计算开销和推理时间。 Conclusion: 所提方法实现了高效、灵活的持续图像恢复，无需结构修改即可适应新任务，兼顾模型性能与计算效率，推动了图像恢复领域的持续学习发展。 Abstract: Continual learning is an emerging topic in the field of deep learning, where a model is expected to learn continuously for new upcoming tasks without forgetting previous experiences. This field has witnessed numerous advancements, but few works have been attempted in the direction of image restoration. Handling large image sizes and the divergent nature of various degradation poses a unique challenge in the restoration domain. However, existing works require heavily engineered architectural modifications for new task adaptation, resulting in significant computational overhead. Regularization-based methods are unsuitable for restoration, as different restoration challenges require different kinds of feature processing. In this direction, we propose a simple modification of the convolution layer to adapt the knowledge from previous restoration tasks without touching the main backbone architecture. Therefore, it can be seamlessly applied to any deep architecture without any structural modifications. Unlike other approaches, we demonstrate that our model can increase the number of trainable parameters without significantly increasing computational overhead or inference time. Experimental validation demonstrates that new restoration tasks can be introduced without compromising the performance of existing tasks. We also show that performance on new restoration tasks improves by adapting the knowledge from the knowledge base created by previous restoration tasks. The code is available at https://github.com/aupendu/continual-restore.

[103] Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Dogucan Yaman,Seymanur Akti,Fevziye Irem Eyiokur,Alexander Waibel

Main category: cs.CV

TL;DR: 提出了一种基于HierSpeech++潜在语音表示的文本到说话人脸合成框架，通过两阶段训练实现高质量音视频对齐。

Details

Motivation: 现有级联方法在语音和面部生成之间存在不一致，且依赖真实音频，限制了合成质量与灵活性。 Method: 设计Text-to-Vec模块将文本转换为Wav2Vec2嵌入，联合条件化语音与面部生成；采用预训练+微调的两阶段训练策略以应对TTS特征分布偏移。 Result: 在推理时无需真实音频即可生成自然、富有表现力的语音与同步面部动作，显著提升唇部同步性和视觉真实感。 Conclusion: 该方法优于传统级联流程，在保持说话人身份的同时实现了紧致的音视频对齐，适用于高质量虚拟人物合成。 Abstract: We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

[104] How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran,Duy M. H. Nguyen,Hoai-Chau Tran,Michael Barz,Khoa D. Doan,Roger Wattenhofer,Ngo Anh Vien,Mathias Niepert,Daniel Sonntag,Paul Swoboda

Main category: cs.CV

TL;DR: 提出了一种名为gitmerge3D的全局感知图令牌合并方法，可将3D点云Transformer中的令牌数量减少90-95%，同时保持竞争力性能，揭示了当前模型存在过度令牌化问题。

Details

Motivation: 现有3D点云Transformer依赖密集令牌表示，导致训练和推理过程中计算与内存开销高，亟需提升效率。 Method: 引入gitmerge3D，一种全局感知的图令牌合并方法，通过识别并合并冗余令牌来减少令牌数量。 Result: 在多个3D视觉任务中验证了该方法，实现了最高达90-95%的令牌缩减率，并显著提升了计算效率，且性能保持稳定。 Conclusion: 当前3D Transformer模型存在显著令牌冗余，gitmerge3D为构建高效3D基础架构提供了新方向。 Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

[105] The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2

Olivier Dietrich,Merlin Alfredsson,Emilia Arens,Nando Metzger,Torben Peters,Linus Scheibenreif,Jan Dirk Wegner,Konrad Schindler

Main category: cs.CV

TL;DR: 本研究探讨了中等分辨率的哥白尼计划遥感影像（Sentinel-1和Sentinel-2）在建筑物损毁评估中的应用潜力，提出xBD-S12数据集并进行实验验证。结果表明，尽管空间分辨率为10米，仍可在多种灾害场景中有效检测和绘制损毁情况，且复杂模型未表现出明显优势，哥白尼影像可作为快速大范围灾损评估的有效数据源。

Details

Motivation: 自然灾害需要快速评估损毁以支持人道主义响应，但高分辨率遥感影像常受限于获取能力，因此探索中等分辨率影像（如哥白尼计划数据）是否可用于广泛、及时的损毁评估具有重要意义。 Method: 构建了一个包含10,315对灾前灾后影像的xBD-S12数据集，涵盖Sentinel-1和Sentinel-2影像，并与现有的xBD基准数据在时空上对齐；通过一系列实验评估不同模型架构在该数据上的损毁检测性能。 Result: 实验证明，尽管地面采样距离为10米，中等分辨率影像仍能在多种灾害场景中较好地检测和绘制建筑物损毁；更复杂的模型在未见灾害类型上泛化能力较差，地理空间基础模型实际增益有限。 Conclusion: 哥白尼计划的中等分辨率影像是一种可行的大范围快速损毁评估数据源，可与高分辨率影像协同使用；作者公开了xBD-S12数据集、代码和训练模型以促进后续研究。 Abstract: Natural disasters demand rapid damage assessment to guide humanitarian response. Here, we investigate whether medium-resolution Earth observation images from the Copernicus program can support building damage assessment, complementing very-high resolution imagery with often limited availability. We introduce xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs from both Sentinel-1 and Sentinel-2, spatially and temporally aligned with the established xBD benchmark. In a series of experiments, we demonstrate that building damage can be detected and mapped rather well in many disaster scenarios, despite the moderate 10$\,$m ground sampling distance. We also find that, for damage mapping at that resolution, architectural sophistication does not seem to bring much advantage: more complex model architectures tend to struggle with generalization to unseen disasters, and geospatial foundation models bring little practical benefit. Our results suggest that Copernicus images are a viable data source for rapid, wide-area damage assessment and could play an important role alongside VHR imagery. We release the xBD-S12 dataset, code, and trained models to support further research.

[106] Photo Dating by Facial Age Aggregation

Jakub Paplham,Vojtech Franc

Main category: cs.CV

TL;DR: 提出一种利用图像中人脸信息进行照片年代估计的新方法，并发布包含160多万标注人脸的CSFD-1.6M数据集；通过结合面部识别、年龄估计和职业时间先验的多脸证据，显著优于基于场景的基线方法。

Details

Motivation: 传统照片年代估计多依赖场景信息，但对人物丰富的图像效果有限；利用多人脸及其身份与出生年份信息可提升估计准确性。 Method: 构建一个概率框架，融合现代人脸识别和年龄估计模型提供的视觉证据，以及基于人物职业生涯的时间先验，聚合单张图像中多个人脸的信息以推断拍摄年份。 Result: 在CSFD-1.6M数据集上实验表明，聚合多个人脸的证据能持续提升性能，显著优于强场景基线方法，尤其适用于含多个可识别个体的图像。 Conclusion: 利用多个人脸的身份和年龄信息结合时间先验是照片年代估计的有效途径，所提方法在多脸场景下表现优越，验证了人脸信息在时间定位任务中的潜力。 Abstract: We introduce a novel method for Photo Dating which estimates the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release CSFD-1.6M, a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines visual evidence from modern face recognition and age estimation models, and career-based temporal priors to infer the photo capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.

[107] EventFlow: Real-Time Neuromorphic Event-Driven Classification of Two-Phase Boiling Flow Regimes

Sanghyeon Chang,Srikar Arani,Nishant Sai Nuthalapati,Youngjoon Suh,Nicholas Choi,Siavash Khodakarami,Md Rakibul Hasan Roni,Nenad Miljkovic,Aparna Chandramowlishwaran,Yoonjin Won

Main category: cs.CV

TL;DR: 提出了一种基于神经形态传感器事件数据的实时流型分类框架，相比传统图像方法具有更高精度和速度，适用于高效热管理系统的实时监控。

Details

Motivation: 现有光学成像方法因计算量大、时间分辨率不足，难以实时捕捉流动状态的瞬态变化，影响热管理系统性能与可靠性。 Method: 利用神经形态传感器获取事件数据，构建五种分类模型（包括LSTM），并与基于帧的方法对比；采用异步处理流水线结合多数投票机制实现低延迟连续预测。 Result: 事件驱动的LSTM模型达到97.6%的分类准确率，单次处理耗时仅0.28 ms，显著优于传统帧基方法。 Conclusion: 该事件驱动框架在准确性与响应速度方面表现优异，可为实验控制与智能热管理提供可靠的实时反馈。 Abstract: Flow boiling is an efficient heat transfer mechanism capable of dissipating high heat loads with minimal temperature variation, making it an ideal thermal management method. However, sudden shifts between flow regimes can disrupt thermal performance and system reliability, highlighting the need for accurate and low-latency real-time monitoring. Conventional optical imaging methods are limited by high computational demands and insufficient temporal resolution, making them inadequate for capturing transient flow behavior. To address this, we propose a real-time framework based on signals from neuromorphic sensors for flow regime classification. Neuromorphic sensors detect changes in brightness at individual pixels, which typically correspond to motion at edges, enabling fast and efficient detection without full-frame reconstruction, providing event-based information. We develop five classification models using both traditional image data and event-based data, demonstrating that models leveraging event data outperform frame-based approaches due to their sensitivity to dynamic flow features. Among these models, the event-based long short-term memory model provides the best balance between accuracy and speed, achieving 97.6% classification accuracy with a processing time of 0.28 ms. Our asynchronous processing pipeline supports continuous, low-latency predictions and delivers stable output through a majority voting mechanisms, enabling reliable real-time feedback for experimental control and intelligent thermal management.

Xian-Hong Huang,Hui-Kai Su,Chi-Chia Sun,Jun-Wei Hsieh

Main category: cs.CV

TL;DR: 本文提出了一种结合语义引导的自然语言处理与先进视觉识别骨干网络的跨模态微小目标检测方法，通过集成BERT与PRB-FPN-Net，并采用ELAN、MSP和CSP等创新架构，显著提升了检测精度与效率。

Details

Motivation: 为解决微小目标检测中特征不明显和上下文理解不足的问题，探索自然语言与视觉信息融合的有效途径。 Method: 将BERT语言模型与基于CNN的PRB-FPN-Net相结合，利用ELAN、MSP和CSP等骨干架构进行多尺度特征提取与融合，并通过词形还原和微调技术对齐文本语义与视觉特征。 Result: 在COCO2017验证集上达到52.6%的平均精度（AP），显著优于YOLO-World，且参数量仅为GLIP等Transformer模型的一半；在COCO和Objects365数据集上均表现出优越性能。 Conclusion: 该方法通过融合自然语言理解与先进骨干网络，在准确性、效率和可扩展性方面设定了新基准，适用于资源受限环境下的复杂场景目标检测。 Abstract: This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Several test on different of backbones such ELAN, MSP, and CSP further enable efficient handling of multi-scale objects, ensuring scalability and robustness in resource-constrained environments. This study underscores the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and adaptability to real-world challenges.

[109] GroupKAN: Rethinking Nonlinearity with Grouped Spline-based KAN Modeling for Efficient Medical Image Segmentation

Guojie Li,Anwar P. P. Abdul Majeed,Muhammad Ateeq,Anh Nguyen,Fan Zhang

Main category: cs.CV

TL;DR: GroupKAN是一种轻量级医学图像分割网络，通过分组KAN变换和分组KAN激活模块，在降低计算复杂度的同时提升精度和可解释性，在多个医学数据集上优于U-KAN，参数更少。

Details

Motivation: 现有卷积网络缺乏自适应非线性和透明决策机制，Transformer则因二次复杂度和不透明注意力机制难以扩展，U-KAN虽有改进但仍受限于全通道变换的高复杂度（O(C^2)），需要更高效可扩展的分割模型。 Method: 提出GroupKAN，引入两个新模块：(1) 分组KAN变换，将通道分为G组进行多元样条映射，将复杂度降至O(C^2/G)；(2) 分组KAN激活，在每组内共享样条映射实现高效的逐token非线性变换。 Result: 在BUSI、GlaS和CVC三个医学图像基准上，GroupKAN平均IoU达到79.80%，比U-KAN高+1.11%，参数量仅为其47.6%（3.02M vs 6.35M），且具有更好的可解释性。 Conclusion: GroupKAN通过结构化分组策略有效平衡了医学图像分割中的精度、效率与可解释性，显著优于U-KAN及其他传统方法，具备良好的应用潜力。 Abstract: Medical image segmentation requires models that are accurate, lightweight, and interpretable. Convolutional architectures lack adaptive nonlinearity and transparent decision-making, whereas Transformer architectures are hindered by quadratic complexity and opaque attention mechanisms. U-KAN addresses these challenges using Kolmogorov-Arnold Networks, achieving higher accuracy than both convolutional and attention-based methods, fewer parameters than Transformer variants, and improved interpretability compared to conventional approaches. However, its O(C^2) complexity due to full-channel transformations limits its scalability as the number of channels increases. To overcome this, we introduce GroupKAN, a lightweight segmentation network that incorporates two novel, structured functional modules: (1) Grouped KAN Transform, which partitions channels into G groups for multivariate spline mappings, reducing complexity to O(C^2/G), and (2) Grouped KAN Activation, which applies shared spline-based mappings within each channel group for efficient, token-wise nonlinearity. Evaluated on three medical benchmarks (BUSI, GlaS, and CVC), GroupKAN achieves an average IoU of 79.80 percent, surpassing U-KAN by +1.11 percent while requiring only 47.6 percent of the parameters (3.02M vs 6.35M), and shows improved interpretability.

[110] TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

Junwen Pan,Qizhe Zhang,Rui Zhang,Ming Lu,Xin Wan,Yuan Zhang,Chang Liu,Qi She

Main category: cs.CV

TL;DR: 本文提出TimeSearch-R，将时间搜索重构为交错的文本-视频推理过程，并通过强化学习进行端到端优化；引入GRPO-CSV方法，利用自验证机制提升视频推理的完整性，在多个长视频理解基准上取得显著性能提升。

Details

Motivation: 现有时间搜索方法依赖手工设计的搜索流程，缺乏端到端优化，难以学习最优搜索策略，且在强化学习过程中存在中间决策无监督、探索不足和推理不一致的问题。 Method: 将时间搜索建模为交错的文本-视频思维过程，采用强化学习（特别是GRPO-CSV）进行训练，其中引入完备性自验证机制，由同一策略模型对已搜索帧的充分性进行验证，以提升推理完整性；并构建专用于SFT冷启动和RL训练的数据集，过滤弱时间依赖样本以增强任务难度。 Result: TimeSearch-R在Haystack-LVBench、Haystack-Ego4D等时间搜索基准，以及VideoMME、MLVU和LongVideoBench等长视频理解基准上均取得显著性能提升，其中在LongVideoBench上相比Qwen2.5-VL基础模型提升4.1%，优于Video-R1模型2.0%。 Conclusion: TimeSearch-R通过将时间搜索融入推理过程并结合GRPO-CSV自验证机制，实现了更完整和一致的视频内容探索，推动了长视频理解中时间搜索的性能边界，为端到端可训练的时间搜索提供了新范式。 Abstract: Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.

[111] Visual Spatial Tuning

Rui Yang,Ziyu Zhu,Yanwei Li,Jingjia Huang,Shen Yan,Siyuan Zhou,Zhe Liu,Xiangtai Li,Shuangye Li,Wenqian Wang,Yi Lin,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了一个名为Visual Spatial Tuning (VST)的框架，旨在提升视觉-语言模型（VLMs）的空间感知与推理能力，通过构建大规模数据集VST-P和VST-R，并采用渐进式训练策略，在不损害通用能力的前提下显著提升了在多个空间理解基准上的表现。

Details

Motivation: 现有方法通过添加额外专家编码器来增强VLM的空间意识，但会带来计算开销并损害通用能力；因此需要一种兼容且高效的方式来系统性提升VLM的类人空间理解能力。 Method: 提出VST框架，包含两个数据集：VST-P（410万样本，用于空间感知）和VST-R（13.5万样本，用于空间推理），并采用分阶段训练：先监督微调建立基础空间知识，再用强化学习提升推理能力。 Result: 在MMSI-Bench上达到34.8%，在VSIBench上达到61.2%，均取得当前最优性能，且未影响模型的通用能力。 Conclusion: VST框架能有效增强VLM的空间感知与推理能力，为实现更具物理 grounded 的视觉-语言-动作模型提供了可行路径。 Abstract: Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Table of Contents

cs.CL [Back]

[1] Evaluating LLMs' Reasoning Over Ordered Procedural Steps

[2] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

[3] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection

[4] Reasoning Up the Instruction Ladder for Controllable Language Models

[5] EncouRAGe: Evaluating RAG Local, Fast, and Reliable

[6] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

[7] Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

[8] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

[9] Measuring what Matters: Construct Validity in Large Language Model Benchmarks

[10] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

[11] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

[12] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

[13] Learning to reason about rare diseases through retrieval-augmented agents

[14] Surprisal reveals diversity gaps in image captioning and different scorers change the story

[15] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

[16] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

[17] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

[18] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

[19] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

[20] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

[21] Diagnosing and Mitigating Semantic Inconsistencies in Wikidata's Classification Hierarchy

[22] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

[23] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

[24] Acquiring Common Chinese Emotional Events Using Large Language Model

[25] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

[26] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

[27] Order-Level Attention Similarity Across Language Models: A Latent Commonality

[28] Reasoning-Guided Claim Normalization for Noisy Multilingual Social Media Posts

[29] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

[30] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

[31] A Toolbox for Improving Evolutionary Prompt Search

[32] ManufactuBERT: Efficient Continual Pretraining for Manufacturing

[33] Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

[34] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

[35] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

[36] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

[37] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

[38] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

[39] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

[40] A multimodal multiplex of the mental lexicon for multilingual individuals

[41] Large Language Models for Explainable Threat Intelligence

[42] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

[43] Steering Language Models with Weight Arithmetic

[44] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

cs.CV [Back]

[45] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

[46] Knowledge-based anomaly detection for identifying network-induced shape artifacts

[47] CPO: Condition Preference Optimization for Controllable Image Generation

[48] DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation

[49] Global 3D Reconstruction of Clouds & Tropical Cyclones

[50] EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear

[51] 3D Gaussian Point Encoders

[52] Data Efficiency and Transfer Robustness in Biomedical Image Segmentation: A Study of Redundancy and Forgetting with Cellpose

[53] An Active Learning Pipeline for Biomedical Image Instance Segmentation with Minimal Human Intervention

[54] Geometry Denoising with Preferred Normal Vectors

[55] Self-Supervised Implicit Attention Priors for Point Cloud Reconstruction

[56] Clinical-ComBAT: a diffusion-weighted MRI harmonization method for clinical applications

[57] Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects

[58] Beta Distribution Learning for Reliable Roadway Crash Risk Assessment

[59] Learning to Restore Multi-Degraded Images via Ingredient Decoupling and Task-Aware Path Adaptation

[60] A benchmark multimodal oro-dental dataset for large vision-language models

[61] DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

[62] CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

[63] Pattern-Aware Diffusion Synthesis of fMRI/dMRI with Tissue and Microstructural Refinement

[64] Learning Fourier shapes to probe the geometric world of deep neural networks

[65] Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features

[66] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

[67] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

[68] Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

[69] Pressure2Motion: Hierarchical Motion Synthesis from Ground Pressure with Text Guidance

[70] Medical Referring Image Segmentation via Next-Token Mask Prediction

[71] No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation

[72] Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

[73] SurgiATM: A Physics-Guided Plug-and-Play Model for Deep Learning-Based Smoke Removal in Laparoscopic Surgery

[74] Deep learning models are vulnerable, but adversarial examples are even more vulnerable

[75] A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

[76] Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start

[77] Early Alzheimer's Disease Detection from Retinal OCT Images: A UK Biobank Study