Skip to content

Table of Contents

cs.CL [Back]

[1] Disaster Question Answering with LoRA Efficiency and Accurate End Position

Takato Yasuno

Main category: cs.CL

TL;DR: 本文提出了一种面向日本自然灾害情境的问答系统,采用优化的BERT-BiLSTM架构与LoRA技术,在参数大幅压缩下实现了高精度的灾害信息抽取(Span F1达0.885),旨在提升灾时可靠、低幻觉的知识服务。

Details Motivation: 自然灾害发生频率低、地域性强,公众缺乏应对经验;现有RAG+大模型方法易产生幻觉,导致误导性信息传播,危及灾时决策。 Method: 基于日语灾害情境构建问答系统,采用cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads架构,并用LoRA进行高效微调。 Result: End Position准确率达70.4%(仅用5.7%参数,6.7M/117M),Span F1得分为0.885,达到实际救灾应用所需精度水平。 Conclusion: 该轻量高效模型验证了领域专用优化对提升灾害问答可靠性与实用性的重要价值,后续需建设基准数据集、知识增强微调、边缘部署及持续学习能力。 Abstract: Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge and experience necessary to determine appropriate responses and actions. While disaster information is continuously updated, even when utilizing RAG search and large language models for inquiries, obtaining relevant domain knowledge about natural disasters and experiences similar to one's specific situation is not guaranteed. When hallucinations are included in disaster question answering, artificial misinformation may spread and exacerbate confusion. This work introduces a disaster-focused question answering system based on Japanese disaster situations and response experiences. Utilizing the cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads architecture with LoRA efficiency optimization, we achieved 70.4\% End Position accuracy with only 5.7\% of the total parameters (6.7M/117M). Experimental results demonstrate that the combination of Japanese BERT-base optimization and Bi-LSTM contextual understanding achieves accuracy levels suitable for real disaster response scenarios, attaining a 0.885 Span F1 score. Future challenges include: establishing natural disaster Q\&A benchmark datasets, fine-tuning foundation models with disaster knowledge, developing lightweight and power-efficient edge AI Disaster Q\&A applications for situations with insufficient power and communication during disasters, and addressing disaster knowledge base updates and continual learning capabilities.

[2] Inference-time Alignment via Sparse Junction Steering

Runyi Hu,Jie Zhang,Shiqian Zhao,Jiale Meng,Jiwei Li,Jason Zeng,Ming Wu,Michael Heinrich,Yonggang Wen,Tianwei Zhang

Main category: cs.CL

TL;DR: 本文提出稀疏推理时对齐(SIA)方法,通过仅在生成路径中高熵的关键决策点进行干预,显著提升对齐效率与生成质量的权衡,减少计算开销。

Details Motivation: 现有token级引导方法在每个解码步都进行密集干预,导致计算开销大且易损害生成质量;作者认为密集干预非必要,需寻找更高效、更自然的干预策略。 Method: 提出Sparse Inference-time Alignment(SIA),识别生成轨迹中高熵junction作为关键决策点,在这些点引入对齐相关奖励信号,实现稀疏干预;支持与Best-of-N等搜索方法结合。 Result: 在多个模型族和对齐任务上验证:仅干预20%–80%的token即可取得更优对齐-效率权衡;对Qwen3等强基座模型,仅干预20% token即媲美甚至超越强微调指令模型;计算成本最多降低6倍。 Conclusion: 稀疏干预比密集干预更有效,能在更强引导的同时更好保留模型原生分布,是推理时对齐的更优范式。 Abstract: Token-level steering has emerged as a pivotal approach for inference-time alignment, enabling fine grained control over large language models by modulating their output distributions without parameter updates. While effective, existing methods rely on dense intervention at every decoding step. This persistent manipulation not only incurs substantial computational overhead but also risks compromising generation quality by excessively drifting from the model's intrinsic distribution. In this work, we show that dense intervention is unnecessary and propose Sparse Inference time Alignment (SIA), which performs sparse junction steering by intervening only at critical decision points along the generation trajectory. Our key insight is that high entropy junctions mark pivotal decision points in the generation trajectory and are particularly susceptible to misalignment, indicating the need to introduce alignment related reward signals at these points. Extensive experiments across different model families and alignment objectives show that steering only 20% to 80% of tokens achieves superior alignment-efficiency trade offs. For strong base models such as Qwen3, intervening on as few as 20% of tokens matches or even surpasses heavily post-trained instruct models. This sparsity enables stronger guidance while better preserving the model's native distribution, integrates seamlessly with search based methods such as Best-of-N, and reduces computational cost by up to 6x.

[3] EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning

Zhyar Rzgar K Rostam,Gábor Kertész

Main category: cs.CL

TL;DR: 本文提出了一种结合预训练语言模型(如BERT、SciBERT、BioBERT)与scispaCy提取的生物医学实体信息,并采用多实例学习(MIL)方法进行EQ-5D量表检测的新方法,在研究层面F1达0.82、召回率近完美,显著优于传统基线。

Details Motivation: 手动筛选大量文献以识别使用EQ-5D量表的研究耗时、易错且不一致,亟需高效准确的自动化方法支持系统评价。 Method: 对通用及领域专用预训练语言模型(BERT/SciBERT/BioBERT)进行微调,并融合scispaCy提取的生物医学实体信息;构建9种PLM与scispaCy组合实验;引入基于注意力池化的多实例学习(MIL)框架,将抽象视为富集句子的包,实现从句子级到研究级的预测聚合。 Result: 在研究层面F1-score达0.82、召回率接近1.0,显著优于词袋模型和近期PLM基线;实体增强有效提升领域适配性与泛化能力。 Conclusion: 实体信息增强与MIL框架的结合可大幅提升EQ-5D自动识别性能,为系统评价中的文献筛选提供高精度、高召回的自动化解决方案。 Abstract: The EQ-5D (EuroQol 5-Dimensions) is a standardized instrument for the evaluation of health-related quality of life. In health economics, systematic literature reviews (SLRs) depend on the correct identification of publications that use the EQ-5D, but manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent. In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models (PLMs), enriched with biomedical entity information extracted through scispaCy models for each statement, to improve EQ-5D detection from abstracts. We conduct nine experimental setups, including combining three scispaCy models with three PLMs, and evaluate their performance at both the sentence and study levels. Furthermore, we explore a Multiple Instance Learning (MIL) approach with attention pooling to aggregate sentence-level information into study-level predictions, where each abstract is represented as a bag of enriched sentences (by scispaCy). The findings indicate consistent improvements in F1-scores (reaching 0.82) and nearly perfect recall at the study-level, significantly exceeding classical bag-of-words baselines and recently reported PLM baselines. These results show that entity enrichment significantly improves domain adaptation and model generalization, enabling more accurate automated screening in systematic reviews.

[4] Applied Sociolinguistic AI for Community Development (ASA-CD): A New Scientific Paradigm for Linguistically-Grounded Social Intervention

S M Ruhul Alam,Rifa Ferzana

Main category: cs.CL

TL;DR: 本文提出应用社会语言学人工智能促进社区发展(ASA-CD)新范式,通过语言学驱动的AI干预解决社区问题,包含语言生物标志物、发展导向NLP和五阶段话语干预协议,并通过实证验证其有效性。

Details Motivation: 解决社区挑战需要语言学基础与AI技术结合的新型干预范式,以应对话语碎片化、排斥性语言等现实问题。 Method: 提出ASA-CD范式,包括:(1) 作为话语碎片化计算指标的语言生物标志物;(2) 以集体成果为优先目标的发展对齐自然语言处理(NLP);(3) 标准化的五阶段话语干预协议;并通过真实与合成语料库开展概念验证研究。 Result: 实证研究表明排斥性语言与负面情绪存在系统性关联,并成功模拟了干预带来的改善效果。 Conclusion: ASA-CD为可扩展、价值观一致的人工智能服务社区赋能提供了统一的方法论、伦理与实证框架。 Abstract: This paper establishes Applied Sociolinguistic AI for Community Development (ASA-CD) as a novel scientific paradigm for addressing community challenges through linguistically grounded, AI-enabled intervention. ASA-CD introduces three key contributions: (1) linguistic biomarkers as computational indicators of discursive fragmentation; (2) development-aligned natural language processing (NLP), an AI optimisation paradigm prioritising collective outcomes; and (3) a standardised five-phase protocol for discursive intervention. A proof-of-concept study, incorporating real-world and synthetic corpora, demonstrates systematic associations between exclusionary language and negative sentiment and simulates intervention-based improvements. ASA-CD provides a unified methodological, ethical and empirical framework for scalable, value-aligned AI in the service of community empowerment.

[5] EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

Amin Banayeeanzade,Qingchuan Yang,Deqing Fu,Spencer Hong,Erin Babinsky,Alfy Samuel,Anoop Kumar,Robin Jia,Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: 本文提出EPSVec,一种差分隐私的轻量级文本生成方法,通过在激活空间中利用数据集向量引导大语言模型生成高质量合成数据,解耦隐私预算与生成过程,支持任意数量样本生成且计算开销低。

Details Motivation: 现有私有文本生成方法效率低下:数据密集、计算慢、依赖大规模私有语料或批量大小才能达到可用质量,而高质量敏感数据又难以共享。 Method: 提出EPSVec方法,利用‘数据集向量’(即私有数据与公共先验在激活空间中的分布差异方向)引导LLM生成;对向量进行一次性提取和差分隐私化处理,之后采用标准解码;结合预训练基础模型与固定示例提示(fixed-shot prompting)提升多样性与保真度。 Result: EPSVec在分布对齐与下游任务效用上均优于现有基线,尤其在低数据场景下表现突出,同时显著降低计算开销。 Conclusion: EPSVec是一种高效、轻量、差分隐私保护的合成文本生成框架,解决了传统方法隐私-效用-效率难以兼顾的问题,为敏感数据场景下的ML开发提供了实用新路径。 Abstract: High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

[6] Reasoning-Based Personalized Generation for Users with Sparse Data

Bo Ni,Branislav Kveton,Samyadeep Basu,Subhojyoti Mukherjee,Leyao Wang,Franck Dernoncourt,Sungchul Kim,Seunghyun Yoon,Zichao Wang,Ruiyi Zhang,Puneet Mathur,Jihyung Kil,Jiuxiang Gu,Nedim Lipka,Yu Wang,Ryan A. Rossi,Tyler Derr

Main category: cs.CL

TL;DR: 本文提出GraSPer框架,通过图预测与推理对齐技术,在用户交互历史稀疏的情况下增强大语言模型的个性化文本生成能力。

Details Motivation: 现实世界中许多用户(如社交平台冷启动用户、电商新注册用户)交互历史稀疏,导致基于LLM的个性化生成效果受限。 Method: GraSPer首先基于图预测用户未来可能交互的项目以扩充上下文;再通过推理对齐为这些预测交互生成文本,进一步丰富上下文;最后基于真实与合成历史联合生成个性化输出。 Result: 在三个个性化生成基准数据集上的实验表明,GraSPer在稀疏用户场景下显著提升个性化生成性能。 Conclusion: GraSPer有效缓解了稀疏上下文下的LLM个性化瓶颈,为冷启动用户提供了可行的个性化生成新范式。 Abstract: Large Language Model (LLM) personalization holds great promise for tailoring responses by leveraging personal context and history. However, real-world users usually possess sparse interaction histories with limited personal context, such as cold-start users in social platforms and newly registered customers in online E-commerce platforms, compromising the LLM-based personalized generation. To address this challenge, we introduce GraSPer (Graph-based Sparse Personalized Reasoning), a novel framework for enhancing personalized text generation under sparse context. GraSPer first augments user context by predicting items that the user would likely interact with in the future. With reasoning alignment, it then generates texts for these interactions to enrich the augmented context. In the end, it generates personalized outputs conditioned on both the real and synthetic histories, ensuring alignment with user style and preferences. Extensive experiments on three benchmark personalized generation datasets show that GraSPer achieves significant performance gain, substantially improving personalization in sparse user context settings.

[7] Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation

Subhadip Mitra

Main category: cs.CL

TL;DR: 本文提出一种基于连续场理论的AI代理记忆系统,将记忆建模为受偏微分方程支配的语义场,而非离散数据库条目;在长上下文基准测试中显著提升多会话、时序推理与知识更新能力,并在多智能体协同中实现近100%集体智能。

Details Motivation: 传统离散式记忆系统难以有效建模记忆的动态演化、重要性衰减与多智能体间交互,亟需更符合认知机制的连续、物理启发的记忆表征范式。 Method: 将记忆建模为语义空间中的连续场,引入扩散、热力学衰减(按重要性调节)和场耦合机制;通过偏微分方程描述其时空演化,并应用于多智能体协同场景。 Result: 在LongMemEval上,多会话推理F1提升116%(p<0.01),时序推理提升43.8%(p<0.001),知识更新检索召回率提升27.8%(p<0.001);多智能体场耦合实现>99.8%集体智能。 Conclusion: 连续场理论为AI记忆建模提供了新范式,能更自然地刻画记忆的动态性、选择性与社会性,在长上下文与多智能体任务中展现出显著优势。 Abstract: We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi-agent scenarios. We evaluate the system on two established long-context benchmarks: LoCoMo (ACL 2024) with 300-turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi-session reasoning over 500+ turns. On LongMemEval, the field-theoretic approach achieves significant improvements: +116% F1 on multi-session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi-agent experiments show near-perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs-fieldmem.

[8] Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases

Riya Adsul,Balachandra Devarangadi Sunil,Isha Nalawade,Sudharshan Govindan

Main category: cs.CL

TL;DR: 本文提出了一种基于向量检索的动态LoRA适配器组合框架,通过在22个NLP数据集上构建任务感知向量库,在推理时检索相似样本并加权融合多个LoRA适配器,实现零样本跨任务泛化,显著提升多任务性能且无需额外训练或模型重训。

Details Motivation: 参数高效微调(如LoRA)虽支持大模型任务适配,但难以高效组合多个专用适配器以应对未见任务。 Method: 构建覆盖22个NLP数据集的任务感知向量数据库;推理时检索相似训练样本,通过核采样计算任务相似性分布,并采用检索加权融合策略(如Linear、Concatenation、TIES、Magnitude Prune)动态合并LoRA适配器。 Result: Linear融合在PIQA和RTE上分别达70.95%和77.62%,远超单任务基线(46%和52%);整体性能常匹配或超越独立微调的专用适配器。 Conclusion: 基于检索的动态适配器融合是一种可扩展、参数高效、无需额外训练的多任务学习新范式,支持冻结模型下的高效、可解释适配器组合。 Abstract: Parameter efficient fine tuning methods like LoRA have enabled task specific adaptation of large language models, but efficiently composing multiple specialized adapters for unseen tasks remains challenging. We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases to enable zero-shot generalization across diverse NLP tasks. Our approach constructs a task-aware vector database by embedding training examples from 22 datasets spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis. At inference time, we retrieve the most similar training examples, compute task similarity distributions via nucleus sampling, and dynamically merge relevant LoRA adapters using retrieval weighted fusion strategies. We evaluated four merging methods Linear, Concatenation, TIES, and Magnitude Prune demonstrating that our dataset centric retrieval approach often matches or exceeds the performance of individually fine-tuned task-specific adapters. Notably, Linear merging achieves 70.95% on PIQA and 77.62% on RTE, substantially outperforming single-task baselines (46% and 52%, respectively). Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition. These results suggest that retrieval based dynamic merging offers a promising direction for scalable, parameter-efficient multitask learning without requiring full model retraining for each new task.

[9] Make Every Draft Count: Hidden State based Speculative Decoding

Yuetao Chen,Xuliang Wang,Xinzhou Zheng,Ming Li,Peng Wang,Hong Xu

Main category: cs.CL

TL;DR: 本文提出了一种新型推测解码系统,通过在隐藏状态层面进行自回归预测、延迟注入token信息,从而复用被验证失败的draft hidden states,显著提升计算效率,最高实现3.3倍加速。

Details Motivation: 现有推测解码中大量draft token验证失败导致计算浪费,本文旨在回收这部分被丢弃的计算资源。 Method: 1)设计基于隐藏状态自回归的draft模型架构;2)提出高效token信息注入机制,构建高质量draft token树并支持从验证失败中重采样token;3)消除设计引入的额外开销以最大化硬件利用率。 Result: 在多个基线上评估,相比标准推测解码最高获得3.3倍推理速度提升。 Conclusion: 通过隐藏状态级复用而非token级丢弃,可显著缓解推测解码中的计算浪费问题,为高效LLM推理提供了新范式。 Abstract: Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.

[10] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal

Mohammed Hamdan,Vincenzo Dentamaro,Giuseppe Pirlo,Mohamed Cheriet

Main category: cs.CL

TL;DR: 本文研究渐进式数据调度(逐步增加训练数据量)在不同文档理解模型上的效率增益,发现其能显著减少BERT等容量受限模型的训练时间,但对LayoutLMv3等具有强归纳偏置的多模态模型无额外收益;该策略主要通过减少数据量而非数据顺序起效,是一种可靠的计算减负方法。

Details Motivation: 探究渐进式数据调度(一种课程学习策略)是否能在架构差异显著的文档理解模型上带来一致的训练效率提升。 Method: 在FUNSD和CORD数据集上,对比BERT(纯文本)与LayoutLMv3(多模态)两种模型,采用33%→67%→100%的渐进数据调度,并设置匹配计算量的基线(Standard-7)以分离课程效应与计算量减少效应;同时开展多种调度顺序的消融实验(渐进、两阶段、逆序、随机)。 Result: 渐进调度使训练墙钟时间减少约33%;在FUNSD上,BERT显著优于匹配计算基线(ΔF1=+0.023, p=0.022),而LayoutLMv3无显著差异(p=0.621);在CORD上所有设置收敛至相近F1(≥0.947);消融实验证实增益源于数据量减少而非顺序。 Conclusion: 渐进式数据调度是一种跨模型家族可靠的计算减负策略,其课程学习特异性收益取决于模型容量与任务复杂度的交互作用。 Abstract: We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($Δ$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.

Ezieddin Elmahjub,Junaid Qadir,Abdullah Mushtaq,Rafay Naeem,Ibrahim Ghaznavi,Waleed Iqbal

Main category: cs.CL

TL;DR: 本文提出了首个评估大语言模型在伊斯兰教法推理能力的基准测试IslamicLegalBench,涵盖7个法学派别、13项任务共718个样本;实验发现当前主流模型表现不佳,正确率最高仅68%,幻觉率达21%,且提示工程改善有限,凸显AI在宗教法律推理上的根本性知识缺失。

Details Motivation: 随着大量穆斯林用户依赖大语言模型(如GPT、Claude、DeepSeek)获取宗教指导,亟需系统评估这些模型在伊斯兰教法(Sharia)推理上的可靠性与安全性。 Method: 构建IslamicLegalBench基准:覆盖逊尼派四大法学派及什叶派三大派别,设计13类任务(含教法判断、证据分析、前提检验等),共718个实例;对9个SOTA模型进行零样本/少样本评估,并量化正确率、幻觉率、虚假前提接受率等指标。 Result: 最佳模型正确率仅68%、幻觉率21%;6/9模型在虚假前提检测中接受误导性假设超40%;中等复杂度任务错误率最高,而高复杂度任务因语义推理呈现‘表面胜任’假象;少样本提示仅对2/9模型提升超1%。 Conclusion: 当前大语言模型缺乏伊斯兰法学的基础知识结构,单纯依赖提示工程无法弥补该缺陷;IslamicLegalBench为宗教领域AI评估提供了首个系统性框架,警示其在精神指导场景中的高风险应用。 Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.

[12] Budget-Aware Agentic Routing via Boundary-Guided Training

Caiqi Zhang,Menglin Xia,Xuchao Zhang,Daniel Madrigal,Ankur Mallick,Samuel Kessler,Victor Ruehle,Saravan Rajmohan

Main category: cs.CL

TL;DR: 本文提出了一种面向自主智能体的预算感知路由方法(Budget-Aware Agentic Routing),通过在每步动态选择低成本或高成本大模型,兼顾任务成功率与预算约束;引入边界引导训练(Boundary-Guided Training)和BoPO优化算法,有效缓解稀疏反馈与廉价失败问题,在保持性能的同时显著降低成本。

Details Motivation: 随着大语言模型作为自主智能体执行长程任务,每步都调用高能力模型成本过高;而现有模型路由方法难以适配智能体场景中路径依赖、延迟反馈和严格单任务预算等挑战。 Method: 提出预算感知智能体路由框架,包括:1)基于始终使用小/大模型两种边界策略构建难度分类与学习锚点;2)分层采样生成成本高效轨迹用于监督微调(SFT)热启动;3)边界引导策略优化(BoPO),融合边界相对奖励与参考引导优势函数。 Result: 实验表明该方法在效率前沿上优于强基线,在显著降低总成本的同时保持相近成功率,并能泛化至严格的推理时预算约束场景。 Conclusion: 本工作建立了智能体路由的基础框架,将模型选择从静态决策转向动态、预算感知的序列决策范式。 Abstract: As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost--success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.

[13] ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

Yuancheng Yang,Lin Yang,Xu Wang,Chao Tong,Haihua Yang

Main category: cs.CL

TL;DR: 本文提出ImpRIF方法,通过将隐含推理指令形式化为可验证的推理图,结合图驱动的思维链推理、合成数据微调与强化学习,显著提升大语言模型对复杂指令(含隐含推理、多约束依赖)的遵循能力。

Details Motivation: 现有LLM在处理含隐含推理、复杂逻辑关系和多约束依赖的复杂指令时表现不足,需深入理解指令中隐含的推理结构。 Method: 提出ImpRIF:1)将复杂指令形式化为可验证的推理图;2)合成大规模单轮/多轮训练数据;3)基于图推理进行微调;4)用强化学习显式训练模型沿图推理。 Result: 在五个复杂指令遵循基准测试中,所提方法显著超越基线模型。 Conclusion: 增强模型对隐含推理的理解能力,能有效提升其复杂指令遵循性能;项目将开源。 Abstract: As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs' understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.

[14] TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

Yanyu Chen,Jiyue Jiang,Jiahong Liu,Yifei Zhang,Xiao Guo,Irwin King

Main category: cs.CL

TL;DR: 本文提出TRACE框架,通过轨迹感知的综合评估方法解决深度研究代理评估中的'高分幻觉'和静态基准局限性问题,引入分层轨迹效用函数和支架式能力评估协议,全面衡量代理的推理质量、效率、稳健性和潜在能力。

Details Motivation: 传统基于结果的评估指标无法捕捉深度研究代理复杂推理的细微差别,存在'高分幻觉'和静态基准无法量化鲁棒性与潜在能力两大挑战。 Method: 提出TRACE框架,包括分层轨迹效用函数(量化过程效率、认知质量与证据支撑)和支架式能力评估协议(通过最小引导需求衡量潜在能力),并构建DeepResearch-Bench基准。 Result: 实验表明TRACE能提供细粒度排序,揭示单一指标所忽略的准确性、效率与鲁棒性之间的关键权衡。 Conclusion: TRACE是一种更全面、更深入的深度研究代理评估范式,突破了传统单一指标和静态基准的局限。 Abstract: The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

[15] Structured Prompt Language: Declarative Context Management for LLMs

Wen G. Gong

Main category: cs.CL

TL;DR: 本文提出了一种SQL风格的声明式语言SPL,将大语言模型视为生成式知识库,并通过显式预算控制、自动查询优化、RAG与持久化内存集成等机制提升LLM应用的可编程性与资源效率;其扩展SPL-flow支持弹性代理流水线与多模型协同,实验证明显著降低提示工程开销并暴露模型成本差异。

Details Motivation: 现有LLM编程范式缺乏对上下文窗口等资源的显式管理,提示工程冗余高、跨模型部署不一致、大文档处理效率低,且缺乏类似SQL的可预测性、可解释性与声明式抽象能力。 Method: 设计SQL启发的声明式语言SPL,引入WITH BUDGET/LIMIT语法、自动查询优化器、EXPLAIN机制;构建SPL-flow实现三层提供方回退策略;提出Text2SPL、MoM路由、逻辑分块(基于CTE的Map-Reduce)、BENCHMARK等五项扩展;提供EBNF语法、Python包及与Prompty/DSPy/LMQL的对比实验。 Result: SPL平均减少65%提示样板代码;预执行即揭示68倍模型成本差异;同一.spl脚本可在OpenRouter($0.002)或本地Ollama(零边际成本)无缝运行;逻辑分块将注意力复杂度从O(N²)降至O(N²/k);所有扩展均在统一声明式框架内实现。 Conclusion: SPL确立了一种以资源意识、声明式控制和跨模型一致性为核心的新LLM编程范式,为构建可审计、可优化、可移植的生成式AI系统提供了坚实基础。 Abstract: We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL's EXPLAIN ANALYZE, and native integration of retrieval-augmented generation (RAG) and persistent memory in a single declarative framework. SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script. Five extensions demonstrate the paradigm's breadth: (1) Text2SPL (multilingual NL->SPL translation); (2) Mixture-of-Models (MoM) routing that dispatches each PROMPT to a domain-specialist model at runtime; (3) Logical Chunking, an intelligent strategy for documents exceeding a single context window--expressed naturally through SPL's existing CTE syntax with no new constructs, decomposing a large query into a Map-Reduce pipeline that reduces attention cost from O(N^2) to O(N^2/k) and runs identically on cloud (parallel) or local hardware (sequential); (4) SPL-flow, a declarative agentic orchestration layer with resilient three-tier provider fallback; and (5) BENCHMARK for parallel multi-model comparison with automatic winner persistence. We provide a formal EBNF grammar, two pip-installable Python packages (spl-llm, spl-flow), and comparison against Prompty, DSPy, and LMQL. SPL reduces prompt boilerplate by 65% on average, surfaces a 68x cost spread across model tiers as a pre-execution signal, and runs the identical .spl script at $0.002 on OpenRouter or at zero marginal cost on a local Ollama instance--without modification.

[16] Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

Sasha Robinson,Kerem Oktar,Katherine M. Collins,Ilia Sucholutsky,Kelsey R. Allen

Main category: cs.CL

TL;DR: 本文通过Sokoban多轮解谜游戏,研究大语言模型(LLMs)作为顾问时的说服力、理性警惕性与任务表现三者间的关系,发现这三种能力相互独立;模型即使能高效解谜,也不一定具备识别恶意建议的能力,但会根据建议意图调整推理所用token数量。

Details Motivation: 随着大语言模型越来越多地参与高风险人类决策,亟需理解其作为顾问所引入的风险,尤其是其在面对善意与恶意信息时的警惕性与说服能力之间的潜在关联。 Method: 采用Sokoban多轮解谜游戏作为实验平台,让LLM代理相互提供建议,系统评估其解谜表现、被说服程度及对恶意建议的识别(即理性警惕性)。 Result: 发现解谜性能、说服能力和理性警惕性是可分离的能力;高性能解谜不意味着能识别欺骗;模型虽仍被误导失败,但会显著调节token使用量(善意建议用更少token推理,恶意建议用更多)。 Conclusion: 说服力、警惕性与任务表现需被独立监测,这对未来AI安全研究至关重要。 Abstract: With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. % as part of the prompt. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

[17] ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi,Jeongsoo Lee,Hyojun Lee,Jay-Yoon Lee

Main category: cs.CL

TL;DR: 本文提出了一个名为ToolMATH的数学导向基准,用于评估在多工具环境下的工具增强型语言模型,强调其在大规模重叠工具集和缺失目标能力情况下的可靠性与推理能力。

Details Motivation: 现有工具增强型语言模型缺乏在复杂、真实多工具场景下系统性评估其可靠性与错误归因的能力,尤其在工具冗余与能力缺失时表现不明。 Method: 构建了ToolMATH基准,包含约8k道数学题与12k个工具,并设计了更难的ToolMATHHard子集;通过控制工具调用规范与多步执行流程,对模型在工具选择、中间结果累积误差、执行漂移及替代工具误用等方面进行诊断性评估。 Result: 发现核心失败原因是模型推理能力不足,导致中间结果误差累积;工具列表冗余会放大早期微小偏差,引发不可逆执行漂移;缺失目标能力时,干扰工具可能部分替代但更常导致无依据的工具调用路径;提升关键在于长程规划一致性与观察利用纪律性,而非局部动作选择。 Conclusion: ToolMATH为工具增强型智能体提供了可验证、可诊断的评估框架,揭示了鲁棒性依赖于全局推理与执行管控,而非单纯工具调用准确率。 Abstract: We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.

[18] VecGlypher: Unified Vector Glyph Generation with Language Models

Xiaoke Huang,Bhavul Gauri,Kam Woh Ng,Tony Ng,Mengmeng Xu,Zhiheng Liu,Weiming Ren,Zhaochong An,Zijian Zhou,Haonan Qiu,Yuyin Zhou,Sen He,Ziheng Wang,Tao Xiang,Xiao Han

Main category: cs.CL

TL;DR: VecGlypher是一个多模态语言模型,能够直接根据文本描述或图像示例生成高质量、可编辑的SVG矢量字形,无需光栅中间步骤。

Details Motivation: 现有基于学习的字体生成方法依赖于精心制作的样例表和光栅到矢量的后处理,限制了可访问性和可编辑性。 Method: VecGlypher采用两阶段训练策略:先在39K个噪声Envato字体上进行大规模续写训练以掌握SVG语法与长程几何;再在2.5K个专家标注的Google Fonts数据集上进行后训练,对齐语言、图像与几何;预处理包括坐标归一化、路径规范化、字体族去重与坐标量化。模型自回归地输出SVG路径token。 Result: 在跨字体族OOD评估中,VecGlypher在纯文本生成任务上显著优于通用大模型和专用矢量字体基线;在图像参考生成任务中达到SOTA,明显超越DeepVecFont-v2和DualVector;消融实验表明模型规模和两阶段训练策略至关重要,绝对坐标序列化效果最佳。 Conclusion: VecGlypher降低了字体创作门槛,支持用文字或示例设计字形,并为未来多模态设计工具提供了可扩展基础。 Abstract: Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.

[19] Evaluating the Usage of African-American Vernacular English in Large Language Models

Deja Dunlap,R. Thomas McCoy

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)对非洲裔美国人白话英语(AAVE)的表征准确性,发现模型普遍存在对AAVE语法特征的误用、少用,并复刻了针对非裔美国人的刻板印象,呼吁训练数据多样化与公平性方法的引入。

Details Motivation: 现有AI自然语言理解评估多基于标准美式英语(SAE),忽视了对非主流方言如AAVE的建模公平性;作者旨在检验LLMs是否能真实、准确地表征AAVE这一具有文化与社会意义的语言变体。 Method: 基于语料库(CORAAL和TwitterAAE)提取AAVE典型语法特征(如ain't)的使用语境;对三个LLM进行AAVE文本生成提示;将模型输出与人类AAVE使用模式对比,并结合情感分析与人工检查评估偏见。 Result: LLMs显著低估和误用AAVE语法特征;生成文本中复现了关于非裔美国人的负面刻板印象;模型在AAVE表征上与人类实际使用存在系统性偏差。 Conclusion: 当前LLMs在AAVE建模上存在严重缺陷与不公平性,亟需扩充多样化语言数据并集成公平性约束机制,以避免技术加剧语言歧视与社会偏见。 Abstract: In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.

[20] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

Barah Fazili,Koustava Goswami

Main category: cs.CL

TL;DR: 本文提出使用多语种平行语料库结合对比学习来增强多语言预训练模型的跨语言对齐能力,显著提升了XLM-Roberta和mBERT等模型在MTEB基准上的多项NLU任务性能,尤其在位对挖掘、语义相似度和分类任务上优于传统英-目标语双语数据。

Details Motivation: 多语言预训练通常缺乏显式的对齐信号,导致表征空间中跨语言对齐效果不佳。 Method: 构建六种目标语言与英语的多向平行语料(通过商用NMT模型翻译获得),并采用对比学习进行跨语言对齐训练。 Result: 在MTEB基准多个任务(位对挖掘+21.3%、语义相似度+5.3%、分类+28.4%)上显著超越英-目标语双语数据;mE5微调后位对挖掘性能也明显提升。 Conclusion: 多向平行语料提供的跨语言监督信号对提升多语言模型的跨语言表征能力至关重要,即使对已优化句向量的模型亦然。 Abstract: Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.

[21] MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Kazi Samin Yasar Alam,Md Tanbir Chowdhury,Tamim Ahmed,Ajwad Abrar,Md Rafid Haque

Main category: cs.CL

TL;DR: 本文提出了首个公开的孟加拉语-英语混合语料库MixSarc,用于隐含意义识别(如幽默、反讽、冒犯性、粗俗性),并基于该数据集对多种模型进行了基准测试,揭示了当前模型在反讽等复杂语义任务上的局限性。

Details Motivation: 孟加拉语-英语代码混合在南亚社交媒体中广泛存在,但缺乏针对其隐含意义(如反讽、幽默)识别的资源;现有模型多面向单语英语或高资源语言,难以处理音译变异、文化指涉和句内语码转换。 Method: 构建了包含9087条人工标注句子的MixSarc语料库,涵盖幽默、反讽、冒犯性、粗俗性四类标签;采集自社交媒体,经系统过滤与多标注员验证;在该数据集上评测了基于Transformer的监督模型及零样本大语言模型(结构化提示)。 Result: 监督模型在幽默检测上表现良好,但在反讽、冒犯性、粗俗性上性能显著下降(受类别不平衡与语用复杂性影响);零样本大模型取得有竞争力的micro-F1,但精确匹配准确率低;外部数据集中超42%负面情感实例具有反讽特征。 Conclusion: MixSarc为文化感知型NLP提供了基础资源,支持代码混合环境下更可靠的多标签建模,并揭示了隐含语义识别的关键挑战。 Abstract: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

cs.CV [Back]

[22] StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

Jinghao Hu,Yuhe Zhang,GuoHua Geng,Kang Li,Han Zhang

Main category: cs.CV

TL;DR: StoryTailor是一种零样本多帧视觉叙事生成方法,通过高斯中心注意力、动作增强奇异值重加权和选择性遗忘缓存三个模块,在单卡RTX 4090上实现动作忠实、身份一致与背景连续的图像序列生成。

Details Motivation: 解决多帧视觉叙事中动作文本忠实性、主体身份保真度与跨帧背景连续性之间的三重张力,且无需微调。 Method: 提出StoryTailor零样本pipeline,包含三个核心模块:高斯中心注意力(GCA)用于动态聚焦主体核心并缓解定位框重叠;动作增强奇异值重加权(AB-SVR)增强文本嵌入中动作相关方向;选择性遗忘缓存(SFC)保留可迁移背景线索、遗忘冗余历史并主动建立跨场景语义关联。 Result: CLIP-T指标提升10–15%,DreamSim略低于强基线,CLIP-I保持视觉可接受且具竞争力;在相同分辨率与步数下,24GB GPU推理速度优于FluxKontext;定性结果展示出富有表现力的交互与稳定演化的场景。 Conclusion: StoryTailor在资源受限的零样本设定下,有效平衡了动作、身份与背景三方面一致性,为长叙事驱动的多帧生成提供了高效可行的新范式。 Abstract: Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.

[23] HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

Yifan Wang,Francesco Pittaluga,Zaid Tasneem,Chenyu You,Manmohan Chandraker,Ziyu Jiang

Main category: cs.CV

TL;DR: HorizonForge是一个统一框架,通过可编辑的高斯点阵和网格重建驾驶场景,支持细粒度3D操控与语言驱动车辆插入,并结合噪声感知视频扩散渲染实现时空一致的高质量生成;同时提出HorizonSuite基准用于标准化评估。

Details Motivation: 现有可控驾驶场景生成方法难以兼顾照片级真实感与精确控制能力。 Method: 提出HorizonForge框架:将场景重建为可编辑的高斯点阵(Gaussian Splats)与网格(Meshes),支持细粒度3D编辑和语言驱动车辆插入;采用噪声感知视频扩散模型进行渲染,保证空间与时间一致性;并构建HorizonSuite综合评估基准。 Result: 实验表明,高斯-网格表示比其他3D表示具有更高保真度;视频扩散的时间先验对连贯合成至关重要;相比次优SOTA方法,用户偏好提升83.4%,FID降低25.19%。 Conclusion: HorizonForge确立了一种简洁而强大的范式,实现了照片级真实感与强可控性兼具的驾驶仿真生成。 Abstract: Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: https://horizonforge.github.io/ .

[24] Scaling View Synthesis Transformers

Evan Kim,Hyunwoo Ryu,Thomas W. Mitchel,Vincent Sitzmann

Main category: cs.CV

TL;DR: 本文系统研究了视图合成Transformer的缩放规律,提出了可扩展视图合成模型(SVSM),证明编码器-解码器架构在计算上可达到最优,并在真实世界基准测试中以更少训练计算超越了先前最先进方法。

Details Motivation: 几何无关的视图合成Transformer虽在新视角合成(NVS)中取得SOTA性能,但其计算缩放规律尚不明确,亟需建立计算最优的训练设计原则。 Method: 开展系统性缩放律研究,提出并验证一种新型编码器-解码器架构——可扩展视图合成模型(SVSM),通过控制训练计算预算、修正架构设计偏差进行公平比较。 Result: SVSM在多个计算量级下与仅解码器模型缩放效果相当,性能-计算Pareto前沿更优,并在真实NVSBenchmarks上以显著降低的训练计算超越先前SOTA。 Conclusion: 编码器-解码器架构可实现计算最优的NVS建模,早期对其不利的结论源于架构设计不当和训练计算预算不一致;SVSM为高效NVS建模提供了新范式。 Abstract: Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

[25] Towards Controllable Video Synthesis of Routine and Rare OR Events

Dominik Schneider,Lalithkumar Seenivasan,Sampath Rapuri,Vishalroshan Anil,Aiza Maksutova,Yiqing Shen,Jan Emily Mangulabnan,Hao Ding,Jose L. Porras,Masaru Ishii,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出了一种用于手术室(OR)视频的扩散生成框架,通过几何抽象、条件控制和微调扩散模型,实现对常规及罕见/安全关键事件的可控合成,并构建了合成数据集用于训练近似无菌区违规事件检测模型。

Details Motivation: 手术室大规模工作流数据(尤其是罕见、安全关键或非典型事件)的采集在操作和伦理上均存在挑战,导致环境智能模型开发受限。 Method: 提出OR视频扩散框架,包含几何抽象模块、条件模块和微调扩散模型:先将OR场景转为几何表示,再进行条件引导,最后生成真实感视频;并基于该框架构建用于检测无菌区违规近失事件的合成数据集。 Result: 在常规事件合成中优于现有视频扩散基线(FVD/LPIPS更低,SSIM/PSNR更高);可实现反事实事件的可控视频合成;基于合成数据训练的AI模型对近安全关键事件检测召回率达70.13%;消融实验验证了关键设计的有效性。 Conclusion: 该方法能从抽象几何表示可控生成常规与罕见OR事件,不仅支持罕见/安全关键场景合成,也为手术室环境智能模型开发提供了新路径。 Abstract: Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.

[26] Momentum Memory for Knowledge Distillation in Computational Pathology

Yongxin Guo,Hao Lu,Onur C. Koyun,Zhengjie Zhu,Muhammet Fatih Demir,Metin Nafi Gurcan

Main category: cs.CV

TL;DR: 本文提出了一种名为Momentum Memory Knowledge Distillation(MoMKD)的跨模态知识蒸馏框架,通过动量更新的记忆库聚合多批次的基因组与组织病理学信息,缓解配对数据稀缺问题,并解耦双模态梯度以避免模态偏差,显著提升仅用组织病理图像进行癌症诊断的性能与泛化性。

Details Motivation: 多模态学习在癌症诊断中潜力巨大,但临床落地受限于配对的组织病理-基因组数据稀缺;现有知识蒸馏方法因依赖批内对齐而存在不稳定性与性能下降问题。 Method: 提出MoMKD框架:1)构建动量更新的记忆库,跨批次聚合基因组与组织病理特征,扩大监督上下文;2)解耦基因组与组织病理分支的梯度,防止基因信号主导组织病理特征学习,消除推理时的模态差距。 Result: 在TCGA-BRCA(HER2、PR、ODX分类)及内部独立测试集上,MoMKD持续优于先进MIL和多模态KD基线,在仅用组织病理图像推理下展现出更强性能与泛化能力。 Conclusion: MoMKD建立了一种鲁棒、可泛化的计算病理学知识蒸馏新范式,有效推动多模态生物医学AI的临床转化。 Abstract: Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.

[27] MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

Sajjad Ghiasvand,Haniyeh Ehsani Oskouie,Mahnoosh Alizadeh,Ramtin Pedarsani

Main category: cs.CV

TL;DR: 本文提出MMLoP框架,通过低秩分解实现视觉-语言模型(如CLIP)的深度多模态提示学习,仅需11.5K可训练参数,在保持参数高效的同时显著提升少样本性能,并引入一致性损失、漂移校正与共享上投影机制以增强跨模态对齐与泛化能力。

Details Motivation: 现有深度多模态提示方法虽性能强但参数量巨大,违背了提示学习初衷的参数高效性;亟需一种兼顾高性能与极低参数量的多模态提示方法。 Method: 提出MMLoP:1)对视觉和文本编码器各层提示采用低秩因子化参数化;2)引入自调节一致性损失(约束特征与logit级表示靠近冻结的零样本CLIP特征);3)均匀漂移校正以消除提示引入的整体嵌入偏移;4)共享上投影耦合视觉与文本提示的低秩因子。 Result: 在三个基准、11个数据集上验证,MMLoP以仅11.5K参数超越多数现有方法(含参数量高数个数量级者),base-to-novel泛化调和平均准确率达79.70%。 Conclusion: MMLoP成功实现了深度多模态提示学习的参数高效化,在精度与效率间取得优越平衡,为轻量级适配VLMs提供了新范式。 Abstract: Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.

[28] FlowFixer: Towards Detail-Preserving Subject-Driven Generation

Jinyoung Jun,Won-Dong Jang,Wenbin Ouyang,Raghudeep Gadde,Jungbeom Lee

Main category: cs.CV

TL;DR: FlowFixer是一个用于主体驱动生成(SDG)的细化框架,通过图像到图像的直接翻译恢复因尺度和视角变化而丢失的细节,并引入自监督训练策略与关键点匹配评估指标,显著提升生成结果的保真度。

Details Motivation: 解决主体驱动生成中因尺度和视角变化导致的细部丢失问题,避免语言提示带来的歧义。 Method: 提出FlowFixer框架,采用图像到图像翻译;设计一步去噪方案生成自监督训练数据,模拟真实SDG错误;引入关键点匹配指标评估细节保真度。 Result: 在定性和定量评估中均超越现有最先进SDG方法,建立高保真主体驱动生成新基准。 Conclusion: FlowFixer有效提升了SDG的细节还原能力,为高保真生成提供了新范式。 Abstract: We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.

[29] Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Asim Unmesh,Kaki Ramesh,Mayank Patel,Rahul Jain,Karthik Ramani

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放词汇零样本时序动作分割(OVTAS)方法,利用视觉-语言模型(VLMs)的零样本能力,通过帧-动作嵌入相似性匹配与相似度矩阵时序分割实现时序动作分割,并系统评估了14种VLM在该任务上的表现。

Details Motivation: 现有时序动作分割方法受限于闭合词表和固定标签集,难以覆盖真实世界中海量且多样的活动;收集全面标注数据不现实,因此亟需开放词汇、零样本的新范式。 Method: 提出一种无训练pipeline:1)Frame-Action Embedding Similarity(FAES),将视频帧与候选动作标签在VLM嵌入空间中对齐;2)Similarity-Matrix Temporal Segmentation(SMTS),基于相似度矩阵施加时序一致性约束。同时系统评测14种VLM在OVTAS任务上的性能。 Result: 在标准基准上,OVTAS在无任何任务特定监督下取得强性能,验证了VLMs用于结构化时序理解的潜力;不同VLM表现差异显著,提供了首个面向开放词汇动作分割的大规模VLM评估结果。 Conclusion: 开放词汇零样本时序动作分割是可行且有前景的方向;VLMs具备支撑复杂时序结构理解的基础能力,但其嵌入空间特性与任务适配性需进一步探究。 Abstract: Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

[30] WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions

Marco Terral,Haotian Zhang,Tianyang Zhang,Meng Lin,Xiaoqing Xie,Haoran Dai,Darsh Kaushik,Pai Peng,Nicklas Scharpff,David Vazquez,Joan Rodriguez

Main category: cs.CV

TL;DR: 本文提出了SVG提取任务,旨在从自然图像中提取可缩放矢量图形,并构建了WildSVG基准(含Natural WildSVG和Synthetic WildSVG两个数据集)以填补该领域评测空白;实验表明现有模型在此任务上表现不足,但迭代优化方法展现出潜力。

Details Motivation: 现有多模态模型在从干净渲染图或文本描述生成SVG时效果良好,但在处理含噪声、杂乱和域偏移的真实图像时性能显著下降,且缺乏适用于真实场景的SVG提取评测基准。 Method: 提出SVG提取新任务;构建WildSVG基准,包含源于真实公司Logo图像的Natural WildSVG数据集和将复杂SVG渲染图合成到真实场景中的Synthetic WildSVG数据集;对前沿多模态模型进行系统评测,并探索迭代优化方法。 Result: 当前最先进多模态模型在WildSVG基准上的表现远未达到实际应用所需水平;迭代精炼方法展现出提升潜力,模型能力正稳步增强。 Conclusion: WildSVG基准为SVG提取研究提供了首个系统性评测基础,揭示了现有方法在真实场景下的局限性,并指明了迭代优化等可行的技术改进方向。 Abstract: We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and find that current approaches perform well below what is needed for reliable SVG extraction in real scenarios. Nonetheless, iterative refinement methods point to a promising path forward, and model capabilities are steadily improving

[31] ECHOSAT: Estimating Canopy Height Over Space And Time

Jan Pauls,Karsten Schrödter,Sven Ligensa,Martin Schwartz,Berkant Turan,Max Zimmer,Sassan Saatchi,Sebastian Pokutta,Philippe Ciais,Fabian Gieseke

Main category: cs.CV

TL;DR: ECHOSAT is a global, temporally consistent tree height mapping system at 10 m resolution, leveraging multi-sensor satellite data and a vision transformer with self-supervised growth regularization to capture both growth and disturbances (e.g., fires) over time.

Details Motivation: Existing global tree height maps are static and fail to capture temporal forest dynamics essential for accurate carbon accounting and climate change mitigation. Method: A specialized vision transformer trained on multi-sensor satellite data performs pixel-level temporal regression; a self-supervised growth loss enforces biologically plausible height trajectories—including gradual growth and abrupt declines from disturbances. Result: ECHOSAT achieves state-of-the-art accuracy for single-year predictions and delivers the first global-scale, temporally resolved tree height map quantifying growth and disturbances over time. Conclusion: ECHOSAT advances global carbon monitoring and forest disturbance assessment by enabling dynamic, high-resolution forest height tracking. Abstract: Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at https://github.com/ai4forest/echosat.

[32] Automating Timed Up and Go Phase Segmentation and Gait Analysis via the tugturn Markerless 3D Pipeline

Abel Gonçalves Chinaglia,Guilherme Manna Cesar,Paulo Roberto Pereira Santiago

Main category: cs.CV

TL;DR: 本文介绍了一个名为tugturn.py的Python工具,用于无标记三维TUG(Timed Up and Go)运动分析,支持相位分割、步态事件检测、时空参数、关节协调性与动态稳定性评估,并提供可复现的HTML报告、CSV数据和可视化输出。

Details Motivation: 当前无标记的仪器化TUG分析缺乏稳健且可复现的处理流程,限制了其在临床与科研中的应用。 Method: 开发基于Python的tugturn.py流程,结合相位分割(站立、行走、转身、再行走、坐下)、基于空间阈值和相对距离策略的步态事件检测(足跟着地、足尖离地),并集成矢量编码(Vector Coding)与外推质心(XCoM)等高级生物力学指标;配置通过TOML文件实现,输出包括HTML报告、CSV表格及质量评估可视化。 Result: 实现了功能完整、开箱即用的markerless TUG分析软件,附带测试数据、命令行示例及完整文档,确保结果可复现。 Conclusion: tugturn.py填补了无标记TUG定量分析中稳健、模块化、可复现软件工具的空白,有望促进临床评估与运动科学研究。 Abstract: Instrumented Timed Up and Go (TUG) analysis can support clinical and research decision-making, but robust and reproducible markerless pipelines are still limited. We present \textit{tugturn.py}, a Python-based workflow for 3D markerless TUG processing that combines phase segmentation, gait-event detection, spatiotemporal metrics, intersegmental coordination, and dynamic stability analysis. The pipeline uses spatial thresholds to segment each trial into stand, first gait, turning, second gait, and sit phases, and applies a relative-distance strategy to detect heel-strike and toe-off events within valid gait windows. In addition to conventional kinematics, \textit{tugturn} provides Vector Coding outputs and Extrapolated Center of Mass (XCoM)-based metrics. The software is configured through TOML files and produces reproducible artifacts, including HTML reports, CSV tables, and quality-assurance visual outputs. A complete runnable example is provided with test data and command-line instructions. This manuscript describes the implementation, outputs, and reproducibility workflow of \textit{tugturn} as a focused software contribution for markerless biomechanical TUG analysis.

[33] PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

Binesh Sadanandan,Vahid Behzadan

Main category: cs.CV

TL;DR: 本文提出PSF-Med基准,评估医学视觉语言模型在胸部X光图像问答中的同义改写敏感性,发现现有模型存在高翻转率且部分依赖文本先验而非图像;通过稀疏自编码器分析识别出与提示措辞相关的关键神经元特征,并通过干预该特征显著降低翻转率,强调需同时评估改写稳定性与图像依赖性。

Details Motivation: 医学视觉语言模型(VLMs)在临床问题同义改写下答案不一致,带来部署风险,亟需系统评估其鲁棒性与视觉接地能力。 Method: 构建PSF-Med基准(19,748个问题+约92,000个语义不变改写),测量六种医学VLM的yes/no答案翻转率;引入文本-only基线检验图像依赖性;对MedGemma 4B模型应用GemmaScope 2稀疏自编码器,在FlipBank数据集上定位并因果干预关键神经元特征。 Result: 六种模型翻转率介于8%–58%;部分模型翻转率低但文本-only基线表现相近,表明依赖语言先验;定位到第17层一个稀疏特征,其激活与提示措辞强相关,因果移除可恢复45%的yes-no对数几率差,完全逆转15%的翻转;推理时钳制该特征使翻转率相对下降31%,仅损失1.3个百分点准确率,并降低文本先验依赖。 Conclusion: 仅靠翻转率不足以衡量医学VLM鲁棒性;稳健性评估必须同步检验同义改写稳定性与图像依赖性;可解释性驱动的特征干预能有效提升模型一致性与视觉接地能力。 Abstract: Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

[34] Automatic Map Density Selection for Locally-Performant Visual Place Recognition

Somayeh Hussaini,Tobias Fischer,Michael Milford

Main category: cs.CV

TL;DR: 本文提出一种动态视觉地点识别(VPR)建图方法,通过分析多遍参考轨迹间的匹配模式,自动选择满足用户指定局部召回率(Recall@1)及其覆盖比例(RAR)的参考地图密度,从而提升VPR系统在实际长时部署中的可控性与可靠性。

Details Motivation: 现有VPR研究多依赖固定采样密度的基准数据集,忽视了参考地图密度对局部性能的影响;而实际部署中需保证系统在环境不同区域均满足用户定义的性能要求,而非仅全局平均性能。 Method: 基于两遍参考轨迹,建模不同地图密度下的匹配模式,预测满足目标局部Recall@1和Recall Achievement Rate(RAR)所需的最优地图密度;在Nordland和Oxford RobotCar等基准上验证该策略。 Result: 所提方法在多个VPR算法和基准上均能稳定达到或超过用户设定的局部Recall@1水平,并覆盖至少指定比例的运行环境;相比基线方法,避免了不必要的过密建图;且发现全局Recall@1无法有效预测RAR。 Conclusion: 地图密度是调控VPR局部性能的关键可调参数;本文提出的动态密度选择机制为实现用户可控、环境自适应的VPR系统提供了可行路径,并揭示了局部性能指标(如RAR)比全局指标更具操作意义。 Abstract: A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.

[35] Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning

Yushen He

Main category: cs.CV

TL;DR: 本文提出SPL框架,通过语义伪标签和原型学习,统一解决无监督与稀疏监督下的3D目标检测问题,显著提升性能并减少人工标注依赖。

Details Motivation: 3D目标检测依赖大量人工标注数据,限制了其可扩展性与适应性;现有无监督/稀疏监督方法面临伪标签质量低、特征挖掘不稳定及缺乏统一训练框架等挑战。 Method: 提出SPL框架:融合图像语义、点云几何与时间线索生成高质量伪标签(含3D框与点级标签);将伪标签作为概率先验,结合记忆初始化与动量更新的多阶段原型学习策略,稳定地从有/无标签数据中挖掘特征。 Result: 在KITTI和nuScenes数据集上,SPL在无监督与稀疏监督两种设定下均显著超越现有最先进方法。 Conclusion: SPL为极少量甚至零人工标注条件下的3D目标检测提供了鲁棒、通用的解决方案。 Abstract: 3D object detection is essential for autonomous driving and robotic perception, yet its reliance on large-scale manually annotated data limits scalability and adaptability. To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged. However, they face intertwined challenges: low-quality pseudo-labels, unstable feature mining, and a lack of a unified training framework. This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection via Semantic Pseudo-labeling and prototype Learning. SPL first generates high-quality pseudo-labels by integrating image semantics, point cloud geometry, and temporal cues, producing both 3D bounding boxes for dense objects and 3D point labels for sparse ones. These pseudo-labels are not used directly but as probabilistic priors within a novel, multi-stage prototype learning strategy. This strategy stabilizes feature representation learning through memory-based initialization and momentum-based prototype updating, effectively mining features from both labeled and unlabeled data. Extensive experiments on KITTI and nuScenes datasets demonstrate that SPL significantly outperforms state-of-the-art methods in both settings. Our work provides a robust and generalizable solution for learning 3D object detectors with minimal or no manual annotations.

[36] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang,Xianzheng Ma,Tianyi Liu,Guangquan Zhou,Yang Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、即插即用的轻量级方法,通过在测试时用视觉证据监督每一步推理,动态扩充文本化视觉证据池,从而抑制多模态链式推理中的视觉幻觉传播。

Details Motivation: 现有基于强化学习的多模态链式推理方法成本高、模型依赖性强、泛化性差;而视觉幻觉在中间推理步骤中一旦发生,会持续污染后续逻辑正确的步骤,导致最终答案错误。 Method: 构建文本化的视觉证据池,在测试时逐token监督推理过程;当证据不足时,由视觉决策模块根据当前推理上下文动态从图像中提取新证据,持续扩充证据池直至达到视觉确定性并终止推理。 Result: 在多个LVLM骨干模型和基准(TreeBench、RH-Bench)上验证有效:TreeBench提升16.5%-29.5%,RH-Bench的RH-AUC提升13.7%,显著降低幻觉率并提升推理准确率,且无需额外训练。 Conclusion: 该方法是一种通用、高效、训练无关的视觉接地推理框架,为缓解多模态大模型中的视觉幻觉问题提供了新范式。 Abstract: Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

[37] Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow

Shimin Hu,Yuanyi Wei,Fei Zha,Yudong Guo,Juyong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于TRELLIS生成主干的前馈式3D编辑框架,通过Voxel FlowEdit实现单视图驱动的全局一致3D形变,并利用法线引导的单-多视图生成模块恢复高保真纹理,解决了现有方法计算密集、多视角不一致及外观失真等问题。

Details Motivation: 现有3D编辑方法依赖逐场景迭代优化、计算开销大,且存在多视角不一致问题;同时难以将无需训练的2D编辑适配到结构化3D表示,并受限于压缩3D特征中的外观保真度瓶颈。 Method: 提出基于TRELLIS的全前馈3D编辑框架;引入Voxel FlowEdit——在稀疏体素隐空间中实现编辑驱动的全局一致3D形变;设计法线引导的单视图到多视图生成模块作为外观先验以恢复高频纹理。 Result: 实验表明该方法可实现快速、全局一致且高保真的3D模型编辑,显著提升编辑效率与质量。 Conclusion: 本工作首次实现了无需迭代优化、单视图驱动、全局几何一致且高外观保真的前馈式3D编辑,为实时可控3D内容创作提供了新范式。 Abstract: Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.

[38] AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification

Hoang-Nhat Nguyen

Main category: cs.CV

TL;DR: 本文提出AHAN网络,通过分层交叉注意力、面部不对称注意力和双胞胎感知配对交叉注意力模块,在同卵双胞胎人脸识别任务中将准确率提升至92.3%,较SOTA提升3.4%。

Details Motivation: 现有方法在标准人脸数据集上表现优异,但在区分同卵双胞胎时性能骤降(88.9%),暴露出生物特征安全系统的严重缺陷;其核心难点在于建模细微的非遗传性个体差异。 Method: 提出Asymmetric Hierarchical Attention Network(AHAN):1)Hierarchical Cross-Attention(HCA)模块实现多尺度语义区域分析;2)Facial Asymmetry Attention Module(FAAM)建模左右脸不对称性;3)Twin-Aware Pair-Wise Cross-Attention(TA-PWCA)作为训练时正则化策略,以同卵双胞胎为最难负样本。 Result: 在ND_TWIN数据集上达到92.3%的同卵双胞胎验证准确率,较当前最优方法提升3.4个百分点。 Conclusion: AHAN通过多粒度分析与面部不对称建模,有效挖掘同卵双胞胎间细微但具判别性的生物特征,显著提升了极端细粒度人脸验证性能,增强了生物识别系统的鲁棒性与安全性。 Abstract: Identical twin face verification represents an extreme fine-grained recognition challenge where even state-of-the-art systems fail due to overwhelming genetic similarity. Current face recognition methods achieve over 99.8% accuracy on standard benchmarks but drop dramatically to 88.9% when distinguishing identical twins, exposing critical vulnerabilities in biometric security systems. The difficulty lies in learning features that capture subtle, non-genetic variations that uniquely identify individuals. We propose the Asymmetric Hierarchical Attention Network (AHAN), a novel architecture specifically designed for this challenge through multi-granularity facial analysis. AHAN introduces a Hierarchical Cross-Attention (HCA) module that performs multi-scale analysis on semantic facial regions, enabling specialized processing at optimal resolutions. We further propose a Facial Asymmetry Attention Module (FAAM) that learns unique biometric signatures by computing cross-attention between left and right facial halves, capturing subtle asymmetric patterns that differ even between twins. To ensure the network learns truly individuating features, we introduce Twin-Aware Pair-Wise Cross-Attention (TA-PWCA), a training-only regularization strategy that uses each subject's own twin as the hardest possible distractor. Extensive experiments on the ND_TWIN dataset demonstrate that AHAN achieves 92.3% twin verification accuracy, representing a 3.4% improvement over state-of-the-art methods.

[39] Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

Zheang Huai,Honglong Yang,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文提出了一种面向医疗多模态场景的工具可信度学习框架TEA-CXA,通过强化学习使AI代理在胸片分析中动态评估并选择最可靠的AI工具,显著提升诊断一致性与性能。

Details Motivation: 现有医疗AI代理缺乏对AI工具实际可靠性(尤其在多模态、易出错场景下)的建模能力,难以有效解决工具间输出冲突问题。 Method: 提出基于智能体学习(agentic learning)的框架,使代理在多轮多模态工具调用中,依据反馈奖励经验性地学习各工具在不同查询类型下的实用可信度;具体实现为TEA-CXA,支持单轮多工具调用、并行推理及多图像输入,并扩展了文本导向的RL代码库以适配医疗多模态场景。 Result: TEA-CXA在胸片分析任务上显著优于当前最优方法及多种基线;代码框架具备通用性,适用于多模态、多轮工具调用的医疗RL研究。 Conclusion: 通过显式建模工具可信度并结合多模态强化学习,可有效缓解医疗AI工具冲突问题,提升代理决策鲁棒性与临床适用性。 Abstract: AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.

[40] Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction

Beizhen Zhao,Sicheng Yu,Guanzhi Ding,Yu Hu,Hao Wang

Main category: cs.CV

TL;DR: 本文提出了一种面向稀疏视角的户外3D场景重建新框架,通过双向伪帧修复和场景感知高斯管理,显著提升了重建完整性、几何一致性和稳定性。

Details Motivation: 在无标定稀疏视角下进行户外3D场景重建极具挑战性,主要受限于复杂光照、尺度变化及极少量输入视图;直接使用扩散模型合成伪帧易引入不合理几何结构,损害重建质量。 Method: 提出双向伪帧恢复方法(基于邻帧引导的扩散合成,结合轻量伪视图去模糊模型与置信度掩码推理)和场景感知高斯管理策略(依据联合深度-密度信息优化3D高斯)。 Result: 在户外基准上实验表明,该方法在保真度与稳定性方面均显著优于现有方法。 Conclusion: 所提框架有效缓解了极端视角稀疏下的几何不一致与漂浮伪影问题,实现了高质量、高鲁棒性的稀疏视角户外重建。 Abstract: 3D scene reconstruction under unposed sparse viewpoints is a highly challenging yet practically important problem, especially in outdoor scenes due to complex lighting and scale variation. With extremely limited input views, directly utilizing diffusion model to synthesize pseudo frames will introduce unreasonable geometry, which will harm the final reconstruction quality. To address these issues, we propose a novel framework for sparse-view outdoor reconstruction that achieves high-quality results through bidirectional pseudo frame restoration and scene perception Gaussian management. Specifically, we introduce a bidirectional pseudo frame restoration method that restores missing content by diffusion-based synthesis guided by adjacent frames with a lightweight pseudo-view deblur model and confidence mask inference algorithm. Then we propose a scene perception Gaussian management strategy that optimize Gaussians based on joint depth-density information. These designs significantly enhance reconstruction completeness, suppress floating artifacts and improve overall geometric consistency under extreme view sparsity. Experiments on outdoor benchmarks demonstrate substantial gains over existing methods in both fidelity and stability.

[41] IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

Pengli Zhu,Yitao Zhu,Haowen Pang,Anqi Qiu

Main category: cs.CV

TL;DR: 本文提出IHF-Harmony,一种基于无配对数据的可逆层次流框架,用于多模态MRI图像的回顾性标准化,通过可逆特征变换保证解剖结构不变形,并在多个MRI模态上验证了其高保真度和下游任务性能优势。

Details Motivation: 现有回顾性MRI标准化方法存在跨模态泛化能力差、依赖受试者跨站点扫描数据等问题。 Method: 提出IHF-Harmony框架,包含可逆层次流(IHF)进行渐进式伪影特征去除,以及伪影感知归一化(AAN)实现解剖结构保持的特征调制,并结合解剖与伪影一致性损失函数。 Result: 在多个MRI模态上实验表明,IHF-Harmony在解剖保真度和下游任务性能上均优于现有方法。 Conclusion: IHF-Harmony为大规模多中心MRI研究提供了鲁棒、高保真的标准化解决方案。 Abstract: Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code will be released upon acceptance.

[42] Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

Junmyeong Lee,Hoseung Choi,Minsu Cho

Main category: cs.CV

TL;DR: MoGaF是一种基于4D高斯点绘表示的长时场景外推框架,通过运动感知的高斯分组和分组优化,实现刚性和非刚性区域的物理一致运动建模,并结合轻量预测模块生成高质量、时间稳定的动态场景预测。

Details Motivation: 动态场景预测在计算机视觉中仍具挑战性,受限观测难以捕捉连贯的物体级运动和长期时序演化。 Method: 提出Motion Group-aware Gaussian Forecasting(MoGaF),基于4D高斯点绘表示,引入运动感知的高斯分组与分组优化,构建结构化时空表征,并设计轻量预测模块进行未来运动预测。 Result: 在合成与真实数据集上,MoGaF在渲染质量、运动合理性及长期预测稳定性方面均优于现有基线方法。 Conclusion: MoGaF通过显式建模运动一致性与结构化时空表示,有效提升了长时动态场景外推的 realism 与 temporal stability。 Abstract: Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at https://slime0519.github.io/mogaf

[43] From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Liangbing Zhao,Le Zhuo,Sayak Paul,Hongsheng Li,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 本文提出PhysicEdit框架,通过构建物理状态转移数据集PhysicTran38K,结合文本-视觉双思考机制,在指令图像编辑中显著提升物理真实性和知识对齐能力。

Details Motivation: 现有基于指令的图像编辑模型在处理涉及折射、材料形变等复杂因果动力学时,难以生成物理上合理的结果,因其将编辑视为离散图像对映射,无法刻画中间物理状态变化。 Method: 提出将物理感知编辑重构为物理状态预测任务;构建包含38K物理状态转移轨迹的大规模视频数据集PhysicTran38K;设计PhysicEdit端到端框架,融合冻结的Qwen2.5-VL进行物理推理与可学习的时序自适应过渡查询,引导扩散模型生成。 Result: PhysicEdit在物理真实性上较Qwen-Image-Edit提升5.9%,在知识对齐编辑上提升10.1%,成为当前开源方法新SOTA,并媲美主流闭源模型。 Conclusion: 将图像编辑建模为物理状态转移并引入显式物理动态监督,是提升编辑结果物理合理性的有效路径;PhysicTran38K和PhysicEdit为物理驱动的生成建模提供了新范式与实用工具。 Abstract: Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

[44] SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen,Dixuan Lin,Jiangping Yang,Youqiang Zhang,Zhengcong Fei,Debang Li,Sheng Chen,Chaofeng Ao,Nuo Pang,Yiming Wang,Yikun Dou,Zheng Chen,Mingyuan Fan,Tuanhui Li,Mingshan Chang,Hao Zhang,Xiaopeng Sun,Jingtao Xu,Yuqiang Xie,Jiahua Wang,Zhiheng Xu,Weiming Xiong,Yuzhe Jin,Baoxuan Gu,Binjie Mao,Yunjie Yu,Jujie He,Yuhao Feng,Shiwen Tu,Chaojie Wang,Rui Yan,Wei Shen,Jingchen Wu,Peng Zhao,Xuanyue Zhong,Zhuangzhuang Liu,Kaifei Wang,Fuxiang Zhang,Weikai Xu,Wenyan Liu,Binglu Zhang,Yu Shen,Tianhui Xiong,Bin Peng,Liang Zeng,Xuchen Song,Haoxiang Guo,Peiyu Wang,Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels V4 是首个支持多模态输入、音视频联合生成与统一生成/修复/编辑的视频基础模型,采用双流多模态扩散Transformer架构,在1080p、32FPS、15秒长视频上实现高保真电影级生成。

Details Motivation: 现有视频生成模型难以同时兼顾多模态条件输入、音视频同步生成、以及生成/修复/编辑任务的统一建模,且在高分辨率长时序下计算成本高昂。 Method: 提出双流Multimodal Diffusion Transformer(MMDiT)架构:视频分支与音频分支共享基于MMLM的文本编码器;视频侧采用通道拼接统一多种inpainting任务;引入低分辨率全序列+高分辨率关键帧联合生成策略,并辅以超分和插帧模型提升效率。 Result: 支持文本、图像、视频片段、掩码、音频等多种模态输入;实现1080p、32FPS、15秒音视频同步生成;在多任务(生成/修复/编辑)上达到电影级质量与强泛化能力。 Conclusion: SkyReels V4首次实现了多模态驱动、音视频联合、生成-修复-编辑一体化的高效高质量视频基础模型,为通用视频理解与生成提供了新范式。 Abstract: SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

[45] UniVBench: Towards Unified Evaluation for Video Foundation Models

Jianhui Wei,Xiaotian Zhang,Yichen Li,Yuan Wang,Yan Zhang,Ziyi Chen,Zhihang Tang,Wei Xu,Zuozhu Liu

Main category: cs.CV

TL;DR: 本文提出UniVBench,一个专为评估视频基础模型而设计的统一基准,涵盖视频理解、生成、编辑和新提出的视频重建任务,并配套开发了统一的智能体评估系统UniV-Eval,以支持公平、可扩展和可复现的多任务综合评测。

Details Motivation: 现有视频模型评估基准碎片化、任务单一、视频简单,无法反映视频基础模型所追求的统一多能力集成特性。 Method: 构建包含200个高质量多镜头视频的UniVBench基准,覆盖四大能力;提出视频重建新任务;开发统一的agentic评估系统UniV-Eval,实现提示、指令解析与评分标准化;所有视频均经人工创作与验证,并配备丰富标注。 Result: UniVBench显著提升了评估复杂度与真实性,首次支持基于指令的多镜头视频统一能力评测;UniV-Eval确保评估与人类判断高度一致,具备公平性、可扩展性与可复现性。 Conclusion: UniVBench与UniV-Eval共同构成了首个面向视频基础模型集成能力的系统性评估框架,为推动鲁棒视频智能研究提供了关键基础设施。 Abstract: Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

[46] DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James Kwok,Yu Zhang

Main category: cs.CV

TL;DR: 本文提出DynamicGTR框架,通过为每个查询动态选择最优图拓扑表示(GTR),提升视觉-语言模型(VLMs)在零样本图问答任务中的准确性与简洁性平衡能力,并展现出跨任务、领域和模型的强迁移性。

Details Motivation: 现有VLMs在图结构理解与问答中受限于单一固定的图拓扑表示(GTR),忽视模型与任务特异性,导致响应不准确或冗长。 Method: 提出DynamicGTR框架,在推理阶段为每个查询动态选择最适配的GTR(如图像或文本形式),支持可定制的准确率-简洁性权衡。 Result: DynamicGTR显著提升VLM在图算法QA上的性能,并能将合成任务训练所得能力零样本迁移到真实图任务(如链路预测、节点分类),且无需额外训练;具备跨任务、跨领域、跨模型的强泛化能力。 Conclusion: DynamicGTR是一种灵活、通用的图增强VLM框架,有效克服了‘一刀切’GTR策略的局限,为零样本图理解与问答提供了新范式。 Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.

[47] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Lingfeng Ren,Weihao Yu,Runpeng Yu,Xinchao Wang

Main category: cs.CV

TL;DR: 本文探究了大型视觉-语言模型(LVLMs)中物体幻觉现象的主要成因,发现其主要源于语言解码器的强先验知识,并提出了一种无需训练、动态抑制语言先验的解码方法NoLan,显著降低了多种LVLM在多个任务上的物体幻觉。

Details Motivation: 探究LVLM中物体幻觉现象主要由视觉编码器还是语言解码器引起。 Method: 设计系统性实验分析两组件作用;提出无需训练的NoLan框架,通过比较多模态与纯文本输入的输出分布差异,动态抑制语言先验。 Result: NoLan在POPE等基准上显著降低物体幻觉,使LLaVA-1.5 7B和Qwen-VL 7B准确率分别提升6.45和7.21。 Conclusion: 物体幻觉主要源于语言解码器的强先验,而非视觉编码器;NoLan是一种简单有效、即插即用的缓解方案。 Abstract: Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.

[48] Solaris: Building a Multiplayer Video World Model in Minecraft

Georgy Savva,Oscar Michel,Daohan Lu,Suppakit Waiwitlikhit,Timothy Meehan,Dhairya Mishra,Srivats Poddar,Jack Lu,Saining Xie

Main category: cs.CV

TL;DR: 本文提出了Solaris,一种支持多视角观察的多人视频世界模型,并开发了配套的多人数据采集系统,在Minecraft等游戏中收集了1264万帧多人视频数据,提出多维度评估框架,并通过分阶段训练策略(含新提出的Checkpointed Self Forcing)显著提升性能。

Details Motivation: 现有动作条件视频生成模型局限于单智能体视角,无法建模真实世界中的多智能体交互。 Method: 构建支持协同多智能体交互与同步视频+动作采集的多人数据系统;收集12.64百万帧多人游戏数据;设计涵盖多智能体运动、记忆、定位、建造和视角一致性的评估框架;采用分阶段训练流程(从单人到多人),融合双向、因果及Self Forcing训练;引入内存高效的Checkpointed Self Forcing以支持更长教师视野。 Result: Solaris在多项评估指标上优于现有基线模型;开源了数据系统与模型,推动多智能体世界模型发展。 Conclusion: Solaris首次实现了具有一致多视角观察能力的多人视频世界建模,其数据系统、评估框架与训练策略为多智能体世界模型研究奠定了新基础。 Abstract: Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

[49] Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

Julian Kaltheuner,Hannah Dröge,Markus Plack,Patrick Stotko,Reinhard Klein

Main category: cs.CV

TL;DR: Neu-PiG 是一种快速、无漂移的动态3D表面重建方法,通过基于关键帧表面位置与法向的多尺度预条件隐式网格编码,结合Sobolev预调节优化,实现秒级高保真重建。

Details Motivation: 解决长时序下动态3D物体从无序点云中进行时间一致表面重建的难题:现有方法存在累积漂移、计算耗时或依赖类别特化训练等问题。 Method: 提出Neu-PiG:构建以单个关键帧表面位置和法向为参数的多分辨率隐式网格编码;引入时间调制机制,并通过轻量MLP解码每帧6自由度形变;采用Sobolev预调节进行梯度优化,无需显式对应关系或额外先验。 Result: 在多人与动物数据集上显著优于SOTA方法,精度更高、更适用于长序列;训练速度比现有无训练方法快至少60倍,推理速度媲美大型预训练模型。 Conclusion: Neu-PiG实现了高效、通用、无漂移的动态表面重建,为实时、长时序动态建模提供了新范式。 Abstract: Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.