Skip to content

Table of Contents

cs.CL [Back]

[1] ReportLogic: Evaluating Logical Quality in Deep Research Reports

Jujia Zhao,Zhaoxin Huan,Zihan Wang,Xiaolu Zhang,Jun Zhou,Suzan Verberne,Zhaochun Ren

Main category: cs.CL

TL;DR: 本文提出ReportLogic基准,用于评估LLM生成研究报告的逻辑质量,强调可审计性,并构建了人类标注数据集与开源逻辑评估模型LogicJudge。

Details Motivation: 现有评估框架忽视了LLM生成研究报告在逻辑层面的可靠性(即主张与论据是否明确支持、可追溯、可验证),而用户依赖此类报告进行深度研究和决策,亟需面向逻辑质量的评估方法。 Method: 提出以读者为中心、基于可审计性的报告逻辑质量评估框架ReportLogic,包含宏观逻辑(结构连贯性)、说明逻辑(上下文充分性)和结构逻辑(主张-支撑显式性)三级分类;构建人工标注数据集,并训练开源评估模型LogicJudge;通过对抗攻击评估评估器鲁棒性。 Result: 发现现成LLM评估器易受表面线索(如冗长)干扰,推理模式可能掩盖支撑关系断裂;LogicJudge展现出更好鲁棒性;为构建更可靠的逻辑评估器和提升报告逻辑可靠性提供实践指导。 Conclusion: 逻辑质量是LLM深度研究报告实用可靠性的核心,ReportLogic为该维度提供了系统化、可扩展的评估范式,推动从‘流畅性’向‘可审计性’评估范式的转变。 Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

[2] ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification

Siran Liu,Cyril Y. He

Main category: cs.CL

TL;DR: 本文提出ConfSpec,一种基于置信度门控的级联验证框架,通过利用小模型在步骤级验证任务上的良好校准性,在不损失准确率的前提下显著提升链式思维推理的推理速度。

Details Motivation: 链式思维推理虽能提升大语言模型在复杂任务上的性能,但因生成轨迹长而导致高推理延迟;现有步骤级推测推理方法在准确性、推理速度和资源效率之间存在长期权衡。 Method: 提出ConfSpec框架:利用小draft模型进行步骤级验证,因其在自身能力范围内具有良好校准性;对高置信度的验证结果直接接受,仅对不确定案例才调用大目标模型进行升级验证;无需外部评判模型,且与token级推测解码正交。 Result: 在多种工作负载上评估显示,ConfSpec可实现最高2.24×的端到端加速,同时保持与目标大模型相当的准确性;无需外部judge模型,且可与token级推测解码叠加获得乘性加速。 Conclusion: ConfSpec成功解决了步骤级推测推理中准确性、速度与资源效率之间的固有权衡,为高效链式思维推理提供了新范式。 Abstract: Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.

[3] INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

Shubham Kulkarni,Alexander Lyzhov,Preetam Joshi,Shiva Chaitanya

Main category: cs.CL

TL;DR: 本文提出了INSURE-Dial,首个面向保险受益验证电话审计的合规感知语音代理基准,包含真实与合成电话数据,并定义了阶段边界检测与合规性验证两项新任务。

Details Motivation: 解决美国医疗系统中每年因行政电话任务(如保险受益验证)造成的巨额经济损失(约1万亿美元),并提升语音代理在真实医疗电话场景中的合规审计能力。 Method: 构建INSURE-Dial基准:含50通真实AI拨打电话与1000通合成电话;所有通话按IVR导航、患者识别、保障状态等阶段结构化标注,并依据显式问答逻辑标注信息与流程合规性;定义两个评估任务——阶段边界检测与合规验证。 Result: 小而低延迟模型在各阶段指标表现良好,但端到端可靠性受限于阶段边界识别错误;真实通话中全通话语句级精确分段准确率低,揭示对话流畅性与审计级证据精度之间存在显著差距。 Conclusion: INSURE-Dial为开发可审计、合规的医疗语音代理提供了关键基准和评估框架,凸显了精准阶段分割对实现可靠合规验证的必要性。 Abstract: Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.

[4] Prompt Optimization Via Diffusion Language Models

Shiyu Wang,Haolin Chen,Liangwei Yang,Jielin Qiu,Rithesh Murthy,Ming Zhu,Zixiang Chen,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于扩散模型的提示词优化框架,利用扩散语言模型(DLMs)通过掩码去噪迭代优化系统提示,无需梯度访问或修改下游大语言模型,在多个基准测试中显著提升了冻结LLM(如GPT-4o-mini)的性能。

Details Motivation: 现有提示词优化方法常依赖梯度信息或需修改目标模型,缺乏灵活性与通用性;本文旨在设计一种无需梯度、模型无关、支持细粒度(span-level)更新的提示优化方法。 Method: 提出基于扩散语言模型(DLM)的提示优化框架,通过条件化于用户查询、模型响应及可选反馈等交互轨迹,执行掩码-去噪式的迭代提示更新,支持灵活的span-level编辑且不依赖目标LLM梯度。 Result: 在τ-bench、SST-2、SST-5等多个基准上,DLM优化后的提示显著提升冻结LLM(如GPT-4o-mini)性能;实验发现中等数量的扩散步数在优化质量与稳定性间取得最佳平衡。 Conclusion: 扩散机制为提示优化提供了通用、模型无关、可扩展的新范式,验证了其作为高效、稳定、免训练提示工程工具的潜力。 Abstract: We propose a diffusion-based framework for prompt optimization that leverages Diffusion Language Models (DLMs) to iteratively refine system prompts through masked denoising. By conditioning on interaction traces, including user queries, model responses, and optional feedback, our method enables flexible, span-level prompt updates without requiring gradient access or modifying the downstream language model. Across diverse benchmarks (e.g., $τ$-bench, SST-2, SST-5), DLM-optimized prompts consistently improve the performance of a frozen target LLM (e.g., GPT-4o-mini). We further show that moderate diffusion step counts provide the best balance between refinement quality and stability. These results highlight diffusion-based prompt optimization as a general, model-agnostic, and scalable approach for enhancing LLM performance through iterative prompt refinement.

[5] Asymptotic Semantic Collapse in Hierarchical Optimization

Faruk Alpay,Bugra Kilictas

Main category: cs.CL

TL;DR: 本文提出'渐近语义坍缩'理论,指出多智能体语言系统中主导节点会通过层级优化使外围节点语义趋同,导致信息熵消失和路径无关的共识形成。

Details Motivation: 解决多智能体语言系统中因共享主导语境而导致个体语义被吸收、行为趋于一致的失败模式。 Method: 将语义状态建模为黎曼流形上的点,分析在具有无限惯性主导锚节点的封闭语言环境中,重复交互引发的投影动力学;结合信息论与微分几何进行理论推导,并在RWKV-7 13B模型上开展无数据集基准测试。 Result: 证明了收敛结果与优化路径无关(梯度更新与随机更新收敛至同一拓扑终点),且语境依赖程度越高,节点熵越低直至消失;实验显示Jaccard相似度分别为0.295(贪心解码)和0.224(随机解码)。 Conclusion: 渐近语义坍缩揭示了一种不可逆的语义共识机制,将多智能体约束于共享语义语法,为理解大语言模型中的群体行为提供了几何与信息论统一视角。 Abstract: Multi-agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near-uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient-style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context-bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information-theoretic quantities with differential-geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset-free benchmark on an RWKV-7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard-to-anchor similarity values of 0.295 and 0.224, respectively.

[6] The Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder

Ihor Stepanov,Mykhailo Shtopko,Dmytro Vodianytskyi,Oleksandr Lukashov

Main category: cs.CL

TL;DR: 本文提出GLiNER-bi-Encoder,一种用于命名实体识别(NER)的新架构,在保持零样本泛化能力的同时显著提升工业级推理效率;通过解耦标签编码器与上下文编码器,突破原模型的二次复杂度瓶颈,支持数千至百万级实体类型识别,并在CrossNER上达到61.5% Micro-F1;同时实现最高130倍吞吐量提升,并延伸出面向Wikidata等大规模知识库的实体链接框架GLiNKER。

Details Motivation: 原始GLiNER联合编码方式随实体标签数增加呈二次时间复杂度,难以扩展至工业级大规模标签场景,亟需兼顾零样本能力与高效推理的新架构。 Method: 提出bi-encoder架构,将标签编码与上下文编码分离,预计算固定标签嵌入,避免每次推理重复编码标签;在此基础上构建模块化实体链接框架GLiNKER。 Result: 在CrossNER基准上零样本Micro-F1达61.5%,为当前最优;标签数为1024时吞吐量较uni-encoder提升最高达130倍;成功支持大规模知识库(如Wikidata)上的高效实体链接。 Conclusion: GLiNER-bi-Encoder在不牺牲零样本性能的前提下,从根本上解决了NER中标签扩展的效率瓶颈,为实际部署超大规模实体识别与链接任务提供了可扩展、高性能的新范式。 Abstract: This paper introduces GLiNER-bi-Encoder, a novel architecture for Named Entity Recognition (NER) that harmonizes zero-shot flexibility with industrial-scale efficiency. While the original GLiNER framework offers strong generalization, its joint-encoding approach suffers from quadratic complexity as the number of entity labels increases. Our proposed bi-encoder design decouples the process into a dedicated label encoder and a context encoder, effectively removing the context-window bottleneck. This architecture enables the simultaneous recognition of thousands, and potentially millions, of entity types with minimal overhead. Experimental results demonstrate state-of-the-art zero-shot performance, achieving 61.5 percent Micro-F1 on the CrossNER benchmark. Crucially, by leveraging pre-computed label embeddings, GLiNER-bi-Encoder achieves up to a 130 times throughput improvement at 1024 labels compared to its uni-encoder predecessors. Furthermore, we introduce GLiNKER, a modular framework that leverages this architecture for high-performance entity linking across massive knowledge bases such as Wikidata.

[7] Luna-2: Scalable Single-Token Evaluation with Small Language Models

Vatsal Goel,Rishon Dsouza,Nikhil Ega,Amey Ramesh Rambatla,Rob Friel,Shuai Shao,Yash Sheth

Main category: cs.CL

TL;DR: 本文提出Luna-2,一种基于小型语言模型(SLM)的轻量、确定性实时评估架构,用于替代昂贵低效的LLM-as-a-judge方法,在毒性、幻觉等指标上达到甚至超越前沿大模型的准确率,同时降低80倍成本和20倍延迟。

Details Motivation: 现有LLM-as-a-judge方法在实时防护场景中存在速度慢、成本高、非确定性等问题,难以满足工业级实时、廉价、稳定评估需求。 Method: 设计基于decoder-only SLM的共享骨干网络,为每种评估指标(如毒性、幻觉)配备轻量LoRA/PEFT适配头,支持百种指标并发运行于单GPU,并支持本地化部署。 Result: 在内容安全与幻觉评测基准上,准确率媲美或优于SOTA LLM评估器,推理成本降低超80倍、延迟降低超20倍;已在生产中服务超1亿AI会话/月,处理超1000亿token,年评估成本节省超3000万美元。 Conclusion: Luna-2证明了小型语言模型经结构化微调可高效、可靠地承担复杂评估任务,为实时AI安全防护提供了可扩展、低成本、低延迟的新范式。 Abstract: Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

[8] DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

Fangyuan Xu,Sihao Chen,Zinan Lin,Taiwei Shi,Sydney Graham,Pei Zhou,Mengting Wan,Alex Stein,Virginia Estellers,Charles Chen,Morris Sharp,Richard Speyer,Tadas Baltrusaitis,Jennifer Neville,Eunsol Choi,Longqi Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为差分隐私强化微调(DP-RFT)的新方法,通过使用差分隐私保护的近邻投票作为奖励信号,在不直接接触私有数据个体样本的前提下,利用在线强化学习训练大语言模型生成高质量合成文本。

Details Motivation: 在差分隐私合成数据生成中,现有方法存在权衡:DP微调需访问原始私有数据,而无需直接访问的方法又受限于未微调模型,导致领域保真度低。本文旨在解决如何在不接触个体私有样本的情况下训练LLM生成高质量合成文本的问题。 Method: 提出DP-RFT算法,利用差分隐私保护的私有语料库近邻投票作为奖励信号,结合PPO在线强化学习优化LLM生成策略,实现‘眼睛离线’(eyes-off)的合成数据生成。 Result: 在长文本和领域特定任务(如新闻、会议记录、医学摘要)上验证了DP-RFT的有效性,其生成数据在保真度与下游任务效用方面显著缩小了与DP微调方法的差距,同时严格遵守隐私边界。 Conclusion: DP-RFT为在不接触个体私有样本前提下高效生成高保真、高实用性的差分隐私合成文本提供了新范式,兼顾隐私保障与模型性能。 Abstract: Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.

[9] PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

Nina Hosseini-Kivanani

Main category: cs.CL

TL;DR: 本文提出了PolyFrame系统,用于多模态习语歧义消解任务,通过冻结大型多模态编码器并仅训练轻量模块,在英语和葡萄牙语上显著提升性能,并在15种语言的盲测中表现稳健。

Details Motivation: 多模态模型难以处理具有非组合性含义的习语表达,这一问题在多语言场景下更为突出。 Method: 提出PolyFrame系统,采用统一pipeline处理图像+文本排序(Subtask A)和纯文本标题排序(Subtask B);使用冻结的CLIP式多模态编码器与多语言BGE M3编码器,仅训练轻量模块(逻辑回归、LLM句子类型预测器、习语同义替换、干扰项感知打分、Borda融合)。 Result: 在英语开发集Top-1达60.0%,零样本迁移到葡萄牙语也达60.0% Top-1(NDCG@5为0.822);15语言盲测中Subtask A平均Top-1/NDCG为0.35/0.73,Subtask B为0.32/0.71;消融实验证明习语感知重写贡献最大。 Conclusion: 无需微调大型多模态编码器,仅靠轻量模块即可实现有效的多语言多模态习语歧义消解。 Abstract: Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.

[10] From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions

Saif M. Mohammad

Main category: cs.CL

TL;DR: 本文构建了首个大规模英语多词表达(MWE)焦虑关联描述性规范词典,涵盖2万余条MWE,验证其高信度,并分析其分布规律与组构性,推动心理学、NLP等领域研究。

Details Motivation: 现有研究多聚焦于单词层面的焦虑关联,缺乏对更大语言单位(如多词表达MWE)的系统考察;需构建可靠、大规模的MWE焦虑词典以支持跨学科研究。 Method: 构建包含20,000+英语MWE的焦虑关联描述性规范词典,通过人工标注与统计分析评估其信度,并分析不同长度MWE中焦虑/平静类表达的分布及组构性特征。 Result: 词典具有高可靠性;发现焦虑与平静相关MWE在二至四词序列中分布存在差异;多数MWE的焦虑关联不完全由其组成词决定(即非完全组构)。 Conclusion: 该词典填补了MWE层面焦虑语言资源的空白,为心理学、自然语言处理、公共卫生和社会科学等领域的焦虑相关研究提供了基础工具和新视角。 Abstract: Anxiety is the unease about a possible future negative outcome. In recent years, there has been growing interest in understanding how anxiety relates to our health, well-being, body, mind, and behaviour. This includes work on lexical resources for word-anxiety association. However, there is very little anxiety-related work on larger units of text such as multiword expressions (MWE). Here, we introduce the first large-scale lexicon capturing descriptive norms of anxiety associations for more than 20k English MWEs. We show that the anxiety associations are highly reliable. We use the lexicon to study prevalence of different types of anxiety- and calmness-associated MWEs; and how that varies across two-, three-, and four-word sequences. We also study the extent to which the anxiety association of MWEs is compositional (due to its constituent words). The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. The lexicon is freely available: https://saifmohammad.com/worrylex.html

[11] Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

Md Badsha Biswas,Ozlem Uzuner

Main category: cs.CL

TL;DR: 本文提出了一种开放域声明验证(ODCV)新系统,利用大语言模型(LLM)、多视角证据检索与跨源分歧分析,通过聚合维基百科、PubMed和Google等多源证据(含原声明与否定形式),提升验证性能与可解释性,并揭示不同知识源间的推理差异。

Details Motivation: 现有自动声明验证系统多依赖单一知识源,忽略不同来源间的分歧,导致知识覆盖不足与透明度低。 Method: 提出基于LLM的ODCV系统,采用多视角检索(原始与否定声明)、多源(Wikipedia、PubMed、Google)证据采集、跨源去重聚合,并结合LLM验证与模型置信度分析以量化可视化跨源分歧。 Result: 在四个基准数据集、五种LLM上的实验表明,知识聚合显著提升验证准确率,并能揭示各知识源特有的推理模式与分歧程度。 Conclusion: 融合多样性、矛盾性与多源证据聚合对构建可靠、透明的声明验证系统至关重要。 Abstract: The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems

[12] Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift

Stephen Russell

Main category: cs.CL

TL;DR: 本文提出了一种统一的形式化框架 $S_t=(X,d_t,P_t)$,将语义漂移的多种信号(如嵌入位移、邻居变化、分布发散、递归轨迹不稳定)整合到一个时序底物中,通过局部扩散与嵌入几何建模节点级漂移、粗Ricci曲率和递归漂移,并引入‘桥质量’作为未来邻域重连的预测指标。

Details Motivation: 现有语义漂移研究缺乏统一理论解释多种观测信号,亟需一个能整合几何与概率动态的可检验形式框架。 Method: 构建时间索引底物 $S_t=(X,d_t,P_t)$,定义节点级邻域漂移、粗Ricci曲率(表征语义扩散局部收缩性)和递归漂移;提出‘桥质量’作为负曲率汇聚度量,并给出可证伪的建模假设与检验契约。 Result: 建立了首个将多类语义漂移信号统合于同一形式化底物的理论模型,导出桥质量可预测邻域重连,并明确列出可实证检验的假设与测试条件。 Conclusion: 该框架为语义漂移提供了几何-概率统一视角,强调可证伪性与可测试性,理论构建完成,实证验证留待后续工作。 Abstract: Most semantic drift studies report multiple signals e.g., embedding displacement, neighbor changes, distributional divergence, and recursive trajectory instability, without a shared explanatory theory that relates them. This paper proposes a formalization of these signals in one time-indexed substrate, $S_t=(X,d_t,P_t)$, combining embedding geometry with local diffusion. Within this substrate, node-level neighborhood drift measures changes in local conditional distributions, coarse Ricci curvature measures local contractivity of semantic diffusion, and recursive drift probes stability of iterated semantic operators. This manuscript specifies the formal model, assumptions, and tests that can refute the model. Herein, the paper introduces bridge mass, a node-level aggregate of incident negative curvature, as a predictor of future neighborhood rewiring. This paper provides the theory and test contracts; empirical performance is deferred to subsequent studies.

[13] ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Zefang Liu,Chenyang Zhu,Sangwoo Cho,Shi-Xiong Zhang

Main category: cs.CL

TL;DR: 本文提出ReHear框架,通过将指令微调、音频感知的大语言模型(LLM)融入自训练循环,实现对伪标签的迭代精炼,从而缓解半监督ASR中伪标签噪声导致的确认偏差与错误累积问题。

Details Motivation: 传统半监督ASR中的伪标签方法易受确认偏差和错误累积影响,因噪声监督质量差。 Method: 提出ReHear框架:在自训练循环中引入能同时感知ASR假设和原始音频的指令微调音频感知大语言模型,用于迭代修正伪标签;修正后的高保真伪标签用于ASR模型微调。 Result: 在多个基准测试中,ReHear有效抑制错误传播,持续优于监督学习和传统伪标签基线方法。 Conclusion: 融合音频与文本信息的指令微调LLM可显著提升伪标签质量,ReHear为半监督ASR提供了一种更鲁棒、高效的自训练范式。 Abstract: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

[14] Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Lichang Song,Ting Long,Yi Chang

Main category: cs.CL

TL;DR: 本文提出CoRAG框架,将检索增强生成(RAG)重构为协同多智能体决策问题,使重排序器与生成器作为对等协作者联合优化,提升生成稳定性与泛化能力。

Details Motivation: 现有RAG系统依赖于以重排序为中心的非对称依赖范式,导致生成质量过度依赖重排序结果,限制了整体性能。 Method: 将RAG建模为协同多智能体决策问题,设计CoRAG框架,使重排序器和生成器作为对等代理,通过共享任务目标进行联合优化,实现协同决策。 Result: CoRAG在仅使用约10K PopQA样本训练的情况下,展现出良好的泛化能力和更稳定的生成性能。 Conclusion: CoRAG通过打破传统RAG中的非对称依赖关系,实现了重排序与生成的深度协同,为知识密集型任务提供了更鲁棒、更灵活的RAG新范式。 Abstract: Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on reranking results of the reranker. To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-makers rather than being connected through an asymmetric dependency pipeline. By jointly optimizing their behaviors toward a shared task objective, the reranker and generator are encouraged to cooperate, ensuring that document reranking and generation work in concert to improve the final response. Experimental results demonstrate good generalization and improved generation stability of CoRAG, even when the model is trained on only around 10K PopQA samples. Our model released in https://anonymous.4open.science/r/CoRAG-D63F

[15] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Anas Alhumud,Abdulaziz Alhammadi,Muhammad Badruddin Khan

Main category: cs.CL

TL;DR: ArabicNumBench 是一个用于评估大语言模型在阿拉伯数字读数任务上的综合基准,涵盖东阿拉伯-印度数字和西阿拉伯数字;研究测试了71个模型、4种提示策略,在210项任务上共59,010个测试用例,发现数值准确率与结构化输出能力显著分离。

Details Motivation: 现有大语言模型在阿拉伯数字理解(尤其是多形式数字读写)方面缺乏系统性评测,尤其缺少兼顾数值准确性与结构化输出能力的基准。 Method: 构建 ArabicNumBench 基准,包含六类上下文(纯数字、地址、日期、数量、价格等)共210项阿拉伯数字读数任务;对71个模型采用四种提示策略(零样本、零样本CoT、少样本、少样本CoT)进行评测;统计准确率及结构化输出比例,并分析提取方法有效性。 Result: 模型准确率跨度大(14.29%–99.05%),少样本CoT效果最优(80.06%),远超零样本(28.76%);高准确率模型常不生成结构化输出(仅6个模型始终满足);数值准确率与指令遵循(结构化输出)是两个独立能力。 Conclusion: 阿拉伯数字理解需同时评估数值正确性与结构化响应能力;ArabicNumBench 为阿拉伯NLP系统提供了可复现基线与实用选型指南。 Abstract: We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

[16] BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Thura Aung,Jann Railey Montalan,Jian Gang Ngui,Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: BURMESE-SAN 是首个面向缅语的综合性大语言模型评测基准,涵盖理解、推理与生成三大能力共七项子任务,强调母语者驱动构建与文化真实性,并揭示模型架构、表征与指令微调比单纯扩大参数规模更关键。

Details Motivation: 解决缅语作为低资源语言缺乏系统性评测基准的问题,弥补现有NLP任务在缅语上的空白,并探究其建模难点(如预训练覆盖不足、形态丰富、句法多变)。 Method: 构建包含7个子任务的缅语评测基准 BURMESE-SAN,采用母语者主导的数据构建流程;对开源与商用大语言模型开展大规模评测;分析性能影响因素。 Result: 缅语性能更依赖模型架构、语言表征和指令微调,而非单纯模型规模;东南亚区域微调和新版本模型显著提升效果。 Conclusion: BURMESE-SAN 为缅语及其它低资源语言提供了可复现、文化适配的评测标准,并推动针对性优化策略(如区域化微调)的发展。 Abstract: We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

[17] Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical,Vivek Hruday Kavuri,Vasudeva Varma

Main category: cs.CL

TL;DR: 本文提出了一种基于Ann Brown调节循环(计划、监控、评估)的心理学启发式元认知框架,通过结构化提示和轻量级双过程MetaController提升大语言模型的自我诊断与纠错能力,在多个基准测试中显著提高了错误诊断准确率和自校正成功率,并获得人类评估的高度信任偏好。

Details Motivation: 大型语言模型(LLMs)虽具强推理能力,但其自我监控、诊断与纠正错误的能力仍有限,亟需提升其可靠性与可解释性。 Method: 构建基于Ann Brown调节循环(计划-监控-评估)的元认知提示架构,并集成至轻量级双过程MetaController中,实现自适应努力分配;在Llama-3和Qwen-3(8B)上进行多任务基准测试与盲评验证。 Result: 在GSM8K、CRUXEval、MBPP、AIME、CorrectBench、TruthfulQA等基准上,错误诊断能力显著提升,成功自校正率提高三倍;580组盲评显示84%人类偏好该方法在可信度与元认知自知性上优于标准及思维链基线。 Conclusion: 将LLM推理扎根于成熟认知理论,为构建更透明、更具诊断鲁棒性的AI系统提供了原理性路径。 Abstract: Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

[18] EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

Adam Dejl,Jonathan Pearson

Main category: cs.CL

TL;DR: 本文提出EvalSense框架,用于构建领域特定的LLM评估套件,通过交互式指南和自动化元评估工具提升评估可靠性与适用性。

Details Motivation: 传统统计指标不适用于开放生成任务,而现有LLM评估方法易受模型、提示、参数等配置影响,存在偏差与误配风险。 Method: 设计EvalSense框架,包含(1)交互式评估方法选择向导;(2)基于扰动数据的自动化元评估工具;支持多模型提供商与多样化评估策略。 Result: 在临床笔记生成任务的案例研究中验证了EvalSense的有效性,提升了评估方法的可靠性与领域适配性。 Conclusion: EvalSense提供了一种灵活、可扩展且实用的LLM评估框架,有助于降低评估偏差,增强关键领域中LLM部署的可信度。 Abstract: Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at https://github.com/nhsengland/evalsense.

[19] DeepInnovator: Triggering the Innovative Capabilities of LLMs

Tianyu Fan,Fengji Zhang,Yuxiang Zheng,Bei Chen,Xinyao Niu,Chengen Huang,Junyang Lin,Chao Huang

Main category: cs.CL

TL;DR: 本文提出DeepInnovator框架,通过结构化科研知识提取与‘下一个研究想法预测’训练范式,系统性地激发大语言模型的原创创新能力,在自动与专家评估中显著超越基线模型。

Details Motivation: 现有方法依赖复杂提示工程,缺乏系统性训练范式来赋予大语言模型自主生成新颖、重要科研想法的创新力。 Method: 提出两阶段训练框架:一是构建自动化数据提取流水线,从海量无标签科学文献中抽取结构化科研知识;二是设计‘下一个研究想法预测’训练范式,建模研究想法生成为预测-评估-优化的迭代过程。 Result: DeepInnovator-14B在自动与专家评估中大幅优于未训练基线(胜率80.53%-93.81%),性能媲美当前领先大语言模型,并将开源数据集与代码。 Conclusion: DeepInnovator为构建具备真实原创创新能力的科研智能体提供了可扩展的训练路径,推动科学发现自动化发展。 Abstract: The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator-14B significantly outperforms untrained baselines, achieving win rates of 80.53\%-93.81\%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open-source the dataset to foster community advancement. Source code and data are available at: https://github.com/HKUDS/DeepInnovator.

[20] Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Abhinaba Basu

Main category: cs.CL

TL;DR: 本文提出W5H2框架和NyayaBench v2数据集,通过结构化意图分解与轻量级微调(SetFit)显著提升缓存键一致性与精度,大幅降低个人AI代理的LLM调用成本。

Details Motivation: 现有缓存方法(如GPTCache、APC)在真实基准上准确率低(0–37.9%),因其错误地将缓存有效性等同于分类准确率,而实际关键在于缓存键的一致性与精度。 Method: 将缓存键评估建模为聚类问题,采用V-measure分解量化一致性与完整性;提出W5H2结构化意图分解框架;在多语言数据集NyayaBench v2(8514条,63种语言)上验证;使用仅8样本/类的SetFit模型实现快速精准预测;构建五层级联缓存系统并集成RCPS提供风险可控的选择性预测保证。 Result: W5H2+SetFit在MASSIVE上达91.1%±1.7%准确率(耗时~2ms),远超GPTCache(37.9%)和20B参数LLM(68.8%,3447ms);在NyayaBench v2(20类)上跨30语种迁移达55.3%;五层级联处理85%交互本地化,预估降本97.5%。 Conclusion: 缓存有效性应以键一致性与精度为核心指标;W5H2框架结合小样本学习与结构化意图建模,可高效支撑低成本、高可靠个人AI代理。 Abstract: Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

[21] Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Toheeb Aduramomi Jimoh,Tabea De Wille,Nikola S. Nikolov

Main category: cs.CL

TL;DR: 本文介绍了Yor-Sarc,首个用于约鲁巴语讽刺检测的黄金标准数据集,包含436个由三位母语者标注的样本,并设计了兼顾文化背景的标注协议,取得了高一致性(Fleiss' κ=0.7660),支持低资源非洲语言的讽刺检测与文化感知NLP研究。

Details Motivation: 讽刺检测在计算语义学中具有挑战性,尤其在缺乏标注数据的低资源语言(如约鲁巴语)中更为困难。 Method: 构建首个约鲁巴语讽刺检测数据集Yor-Sarc,由三位不同方言背景的母语者依据文化适配的标注协议进行标注,并进行严格的跨标注者一致性分析(Fleiss’ κ 和 Cohen’s κ)。 Result: 获得实质性至几乎完美的一致性(Fleiss’ κ = 0.7660;成对 Cohen’s κ = 0.6732–0.8743),83.3% 样本达成全员一致,16.7% 采用多数票软标签;其中一对标注者达到近乎完美的κ=0.8743。 Conclusion: Yor-Sarc填补了低资源非洲语言讽刺检测数据空白,其标注协议和一致性分析为其他非洲语言相关研究提供了可复现的方法论范例。 Abstract: Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yorùbá, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $κ= 0.7660$; pairwise Cohen's $κ= 0.6732$--$0.8743$), with $83.3\%$ unanimous consensus. One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining $16.7\%$ majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarc\footnote{https://github.com/toheebadura/yor-sarc} is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.

[22] Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron,Shiri Gilboa,Tammuz Dubnov

Main category: cs.CL

TL;DR: 本文提出Whisper: Courtside Edition,一种无需重训练即可增强Whisper语音识别性能的多智能体大语言模型(LLM)流水线,通过领域上下文识别、命名实体识别和术语检测等LLM代理生成精简提示来引导Whisper解码器,在NBA篮球解说数据集上实现17.0%相对词错误率下降。

Details Motivation: 领域特定语音对现有ASR系统(如Whisper)仍是挑战,需低成本、可扩展的域适应方法。 Method: 构建多智能体LLM流水线,在Whisper原始转录后引入领域上下文识别、命名实体识别和术语检测模块,生成紧凑提示以引导Whisper解码器,不进行模型重训练。 Result: 在421段NBA解说音频上,最佳流水线将词错误率(WER)从0.217降至0.180(相对降低17.0%,p<0.001);40.1%样本改善,仅7.1%退化,显著优于直接后编辑。 Conclusion: 基于提示的增强方法可实现可扩展的ASR领域适配,是昂贵微调的实用替代方案。 Abstract: Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

[23] Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee

Main category: cs.CL

TL;DR: 本文指出语言智能体失败多源于可靠性问题(随机漂移),而非能力不足;通过Toolathlon基准实验发现,成功运行更贴近规范解路径,且偏离会自我强化;提出基于中段路径遵从度的重启监控策略可提升成功率8.8个百分点。

Details Motivation: 解释为何语言智能体在具备解决能力的任务上仍频繁失败,区分可靠性失败与能力失败,并探究其内在机制。 Method: 基于Toolathlon基准开展自然实验,分析22个前沿模型在108个真实工具使用任务上的515个‘模型×任务’单元(同一模型在相同任务上部分成功、部分失败),量化轨迹与规范解路径的Jaccard相似度,检验偏离的时序模式与自增强效应,并进行六种鲁棒性检验及干预实验。 Result: 成功运行比失败运行显著更贴近规范解路径(Jaccard +0.060, p<0.0001);偏离效应在轨迹前50%不显著,之后呈自增强:每次偏离使下一次偏离概率上升22.7个百分点;基于中段路径遵从度重启最差1/3运行,成功率提升+8.8个百分点。 Conclusion: 语言智能体的可靠性问题本质是随机采样引发的渐进式、自强化的路径漂移,无法仅靠扩大模型能力解决;需引入动态监控与重试机制等可靠性专用优化策略。 Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This result survives six robustness checks including cross-model-family leave-one-out validation. Critically, the causal mechanism is gradual and self-reinforcing: the adherence gap is statistically indistinguishable from zero through the first 50% of the trajectory, ruling out early-branching selection bias, and each off-canonical tool call raises the probability that the next call is also off-canonical by 22.7 percentage points ($\hatβ=+0.227$, $p<0.0001$), more than doubling the baseline rate. These findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by $+$8.8 percentage points among intervened runs.

[24] Uncovering Context Reliance in Unstructured Knowledge Editing

Zisheng Zhou,Mengqi Zhang,Shiguang Wu,Xiaotian Ye,Chi Zhang,Zhumin Chen,Pengjie Ren

Main category: cs.CL

TL;DR: 本文提出COIN框架,通过减少上下文依赖性来提升大语言模型对非结构化知识的编辑鲁棒性。

Details Motivation: 解决现有基于下一词预测(NTP)的LLM编辑方法中存在的Context Reliance问题,即编辑后知识高度依赖于特定上下文,导致推理时上下文缺失时召回失败。 Method: 识别并理论分析Context Reliance源于梯度优化对上下文表征的绑定;提出COIN框架,鼓励模型聚焦局部知识而非记忆上下文模式。 Result: COIN将Context Reliance降低45.2%,编辑成功率较强基线提升23.6%。 Conclusion: 缓解Context Reliance对实现鲁棒的大语言模型知识编辑至关重要。 Abstract: Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based approaches, where knowledge acquired from edited text becomes highly dependent on its preceding context, leading to recall failures when that context is absent during inference. This hypothesis is supported by our empirical validation that prepending context during inference recovers knowledge recall. We further theoretically demonstrate that Context Reliance is an inherent consequence of gradient-based optimization, which tends to bind acquired knowledge to a specific aggregated contextual representation. To address this, we propose a simple yet effective COntext-INdependent editing framework (COIN), encouraging model to focus on knowledge within local scope rather than memorizing contextual patterns. Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.

[25] IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Yinhan He,Yaochen Zhu,Mingjia Shi,Wendy Zheng,Lin Su,Xiaoqing Wang,Qi Guo,Jundong Li

Main category: cs.CL

TL;DR: 本文提出IAPO框架,通过基于条件互信息的token级优势分配,实现推理过程的信息感知优化,在提升准确性的同时显著减少推理长度。

Details Motivation: 现有长思维链方法虽提升准确率但带来高昂推理开销,且序列级奖励塑形对token级推理努力分配控制有限。 Method: 提出基于信息论的IAPO后训练框架,依据各token与最终答案的条件互信息计算token级优势,从而识别高信息量推理步骤并抑制低效探索。 Result: 理论证明IAPO可单调减少推理长度而不损害正确性;实验表明其在多个推理数据集上准确率提升且推理长度最多减少36%,优于现有token高效RL方法。 Conclusion: 信息感知的优势塑形是token高效后训练的一条有力且通用的新方向。 Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

[26] Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Chenhang Cui,An Zhang,Yuxin Chen,Gelei Deng,Jingnan Zheng,Zhenkai Liang,Xiang Wang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文发现大型语言模型(LLM)和大型视觉-语言模型(LVLM)在多步推理中共享大量神经元,提出一种参数高效的低秩融合方法SNRF,将LLM的成熟推理能力迁移至LVLM,显著提升其推理性能且不损害感知能力。

Details Motivation: 尽管LVLM快速发展,但在需多步推理与组合决策的任务上仍弱于纯文本LLM;作者受二者共用Transformer架构启发,探究其内部推理计算是否具有跨模态共性。 Method: 在神经元层面分析LLM与LVLM在多步推理中的激活重叠;通过因果探针(激活放大)验证共享神经元的概念可解释性与功能作用;据此提出Shared Neuron Low-Rank Fusion(SNRF)框架:基于跨模型激活剖面识别共享神经元,对权重差做低秩近似,并仅在共享子空间内注入更新。 Result: 发现超50%的高激活推理神经元在LLM与LVLM间共享;SNRF在数学与感知多任务基准上持续提升LVLM推理性能,同时保持原有感知能力;无需大规模多模态微调。 Conclusion: LLM与LVLM存在模态不变的推理子空间,共享神经元构成可解释的跨模型能力迁移桥梁,SNRF实现了低成本、高效率的推理能力迁移。 Abstract: Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during multi-step inference are shared between representative LLMs and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at [https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons](https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons).

[27] TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes

Roman Egger

Main category: cs.CL

TL;DR: TriTopic是一种新型主题建模框架,通过融合语义嵌入、TF-IDF和元数据的三模态图,结合混合图构建、共识Leiden聚类与迭代精炼技术,显著提升稳定性、精确性和覆盖率,在多个基准数据集上NMI指标全面领先。

Details Motivation: 解决现有主题建模方法(如BERTopic)存在的随机不稳定性、词汇精度丢失(Embedding Blur)以及单视角依赖等关键缺陷。 Method: 提出TriTopic框架:1)基于Mutual kNN与Shared Nearest Neighbors的混合图构建;2)Consensus Leiden聚类保障稳定可复现划分;3)Iterative Refinement通过动态质心牵引增强嵌入锐度;4)采用边界案例定义的原型化主题表示,取代传统‘平均文档’表征。 Result: 在20 Newsgroups、BBC News、AG News和Arxiv四个数据集上NMI均达最高(均值0.575),远超BERTopic(0.513)、NMF(0.416)和LDA(0.299);实现100%语料覆盖且无异常值;已开源为PyPI库。 Conclusion: TriTopic通过三模态融合与多项技术创新,系统性克服了主流主题建模方法的核心局限,在准确性、鲁棒性与可解释性上取得实质性突破。 Abstract: Topic modeling extracts latent themes from large text collections, but leading approaches like BERTopic face critical limitations: stochastic instability, loss of lexical precision ("Embedding Blur"), and reliance on a single data perspective. We present TriTopic, a framework that addresses these weaknesses through a tri-modal graph fusing semantic embeddings, TF-IDF, and metadata. Three core innovations drive its performance: hybrid graph construction via Mutual kNN and Shared Nearest Neighbors to eliminate noise and combat the curse of dimensionality; Consensus Leiden Clustering for reproducible, stable partitions; and Iterative Refinement that sharpens embeddings through dynamic centroid-pulling. TriTopic also replaces the "average document" concept with archetype-based topic representations defined by boundary cases rather than centers alone. In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs. 0.513 for BERTopic, 0.416 for NMF, 0.299 for LDA), guarantees 100% corpus coverage with 0% outliers, and is available as an open-source PyPI library.

[28] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Seong Hah Cho,Junyi Li,Anna Leshinskaya

Main category: cs.CL

TL;DR: 本文研究大型语言模型(LLMs)是否能区分道德、语法和经济三类“好”,发现其存在价值纠缠现象,即道德价值过度影响语法与经济判断;通过针对性地消融道德相关激活向量可修复该问题。

Details Motivation: 人类的价值表征具有区分不同价值类型(如道德、语法、经济)的能力,而LLM的价值对齐需实证测量其实际习得的价值表征,因此需检验LLM是否具备类似区分能力。 Method: 通过探测模型行为、词嵌入及残差流激活,分析LLM对道德、语法和经济三类‘好’的表征差异,并采用选择性激活向量消融方法验证因果关系。 Result: 发现LLM中存在普遍的价值纠缠现象:语法和经济价值判断显著受到道德价值干扰,偏离人类规范;消融与道德相关的激活向量后,该纠缠现象得以缓解。 Conclusion: LLM当前的价值表征缺乏对不同价值类型的清晰区分,存在系统性纠缠,需通过机制层面干预(如定向神经元调控)提升价值对齐的精细度。 Abstract: Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

[29] Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models

Kainan Liu,Yong Zhang,Ning Cheng,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao

Main category: cs.CL

TL;DR: 本文提出Astra,一种新的参数高效微调方法,利用模型输出激活的尾部特征向量子空间构建任务自适应低秩适配器,在减少参数的同时提升性能和收敛速度。

Details Motivation: LoRA及其变体未充分利用对应于尾部特征向量的激活子空间,导致微调性能次优。 Method: Astra通过小规模任务校准集估计模型输出激活的尾部特征向量,并将参数更新约束在该子空间内,构建低秩适配器。 Result: 在16个NLU/NLG基准上一致优于现有PEFT方法,某些场景下甚至超过全量微调(FFT)。 Conclusion: 激活空间中尾部特征向量子空间蕴含重要任务信息,Astra通过有效利用该子空间实现了更高效、更强的参数高效微调。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA, are widely used for adapting pre-trained models to downstream tasks due to their computational and storage efficiency. However, in the context of LoRA and its variants, the potential of activation subspaces corresponding to tail eigenvectors remains substantially under-exploited, which may lead to suboptimal fine-tuning performance. In this work, we propose Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT method that leverages the tail eigenvectors of the model output activations-estimated from a small task-specific calibration set-to construct task-adaptive low-rank adapters. By constraining updates to the subspace spanned by these tail eigenvectors, Astra achieves faster convergence and improved downstream performance with a significantly reduced parameter budget. Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (FFT) in certain scenarios.

[30] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

Michael McCoubrey,Angelo Salatino,Francesco Osborne,Enrico Motta

Main category: cs.CL

TL;DR: 本文首次研究了大语言模型(LLMs)如何通过稀疏自编码器提取的单义特征来表征科学质量,并识别出四类与研究质量相关的关键特征:研究方法、出版类型、高影响力领域与技术、以及特定科技术语。

Details Motivation: 尽管已有研究表明大语言模型能在一定程度上评估科研质量,但其内部如何表征‘科学质量’这一概念的机制尚不清楚。 Method: 使用稀疏自编码器从大语言模型中提取与科学质量相关的单义特征,并在不同实验设置下验证这些特征对引用数、期刊SJR和h指数的预测能力。 Result: 发现LLMs编码了多个维度的科学质量特征,包括研究方法、文献综述类出版物、高影响力领域/技术、以及特定科学术语四类反复出现的特征。 Conclusion: 该研究揭示了LLMs表征科研质量的内在机制,为理解其在学术评估中的作用提供了重要基础。 Abstract: In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that LLMs encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how LLMs encapsulate concepts related to research quality.

[31] AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You,Wenkai Yu,Wentao Zhang

Main category: cs.CL

TL;DR: 本文提出了首个主要由大语言模型自动构建的Agentic RAG基准测试集AgenticRAGTracer,支持多跳推理过程的逐步验证,旨在解决现有基准缺乏中间步骤标注、人工构建成本高且泛化性差的问题。实验表明当前顶尖模型在此基准上表现较差,失败主因是推理链失真(过早收敛或过度延展),揭示了模型在步骤分配与逻辑结构匹配上的根本缺陷。

Details Motivation: 现有Agentic RAG基准缺乏中间hop级问题标注,难以定位失败环节;且多为人工构建,成本高、难扩展、泛化弱。 Method: 提出AgenticRAGTracer:首个以大语言模型为主自动构建的Agentic RAG基准,覆盖多领域、含1305个样本、无主流基准重叠,支持hop-level细粒度诊断。 Result: GPT-5在最难子集上EM准确率仅22.6%;hop-aware分析发现失败主因是推理链失真(提前坍缩或过度延展),暴露模型无法按任务逻辑结构合理分配推理步数。 Conclusion: AgenticRAGTracer填补了多跳推理可解释评估的空白,为Agentic RAG研究提供了新诊断维度和可靠基准,有望推动该方向实质性进展。 Abstract: With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.

[32] A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions

Stefanie Schneider,Miriam Göldl,Julian Stalter,Ricarda Vollmer

Main category: cs.CL

TL;DR: 本文介绍了FRAME数据集,一个用于艺术史图像描述的细粒度命名实体识别(NER)和关系抽取(RE)的手动标注数据集,包含三层独立标注(元数据、内容、共指),支持与Wikidata对齐的37类实体及关系链接,并以UIMA XMI格式发布,可用于NER/RE模型评测与微调,包括零样本和少样本大语言模型设置。

Details Motivation: 艺术史文本中蕴含丰富的结构化信息,但缺乏专门面向艺术史领域的细粒度命名实体识别与关系抽取的高质量标注数据集,限制了相关NLP任务在该领域的进展。 Method: 构建了一个名为FRAME的手动标注数据集,涵盖来自博物馆目录、拍卖列表、开放平台和学术数据库的艺术品描述;采用三层独立标注(元数据层、内容层、共指层),对37种实体类型进行标注并建立带类型的实体间关系链接;实体类型与Wikidata对齐以支持命名实体链接和知识图谱构建;数据以UIMA XMI CAS格式发布,并附带图像与文献元数据。 Result: 发布了FRAME数据集,包含细粒度、多层、Wikidata对齐的NER与RE标注,支持零样本与少样本大语言模型实验,并可作为基准评测和微调工具。 Conclusion: FRAME填补了艺术史领域细粒度NER与RE数据集的空白,为艺术史文本的结构化理解、知识图谱构建及大语言模型在专业领域的适配提供了坚实基础。 Abstract: This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).

[33] Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

Wenqiu Tang,Zhen Wan,Takahiro Komamizu,Ichiro Ide

Main category: cs.CL

TL;DR: 本文提出了一种基于对比学习的稀疏自编码器(SAE)框架,用于在角色扮演智能体中实现细粒度、稳定且可解释的人格控制,结合大五人格30维度模型与动态路由机制,在不依赖大量标注数据或重复训练的前提下,显著提升了人格一致性与对话质量。

Details Motivation: 现有方法存在局限:监督微调(SFT)需大量标注数据且缺乏灵活性;提示工程和RAG易在长对话中失效,导致人格漂移和行为不一致。 Method: 构建了一个15,000样本、防信息泄露的细粒度人格标注语料库,基于大五人格30维度模型,训练对比式稀疏自编码器(SAE)以学习残差空间中的人格控制向量,并通过特质激活的路由模块动态选择并注入这些向量。 Result: 在多个LLM上实验表明,该方法在人格保真度和输出质量上均优于CAA和纯提示基线;SAE+Prompt组合取得最佳综合性能。 Conclusion: 对比学习得到的隐式人格向量能有效增强人格控制能力,同时保持对话连贯性,为灵活、可解释、低开销的角色人格建模提供了新范式。 Abstract: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.

[34] TurkicNLP: An NLP Toolkit for Turkic Languages

Sherzod Hakimov

Main category: cs.CL

TL;DR: TurkicNLP是一个开源Python库,为使用拉丁、西里尔、波斯-阿拉伯和古突厥鲁尼四种文字的突厥语族语言提供统一、一致的自然语言处理流程,涵盖分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入和机器翻译等功能,并采用模块化多后端架构,支持规则与神经模型无缝集成及自动文字检测。

Details Motivation: 突厥语族语言(使用者超2亿)的自然语言处理资源分散、缺乏统一工具和标准,亟需整合性解决方案。 Method: 构建开源Python库TurkicNLP,采用模块化多后端架构,融合基于规则的有限状态转换器与神经模型;支持自动文字检测与路由;所有输出遵循CoNLL-U标准;提供语言无关API。 Result: 实现了覆盖四大文字体系的突厥语族统一NLP流水线,包含8类核心功能,具备跨脚本互操作性与可扩展性,并已开源代码与文档。 Conclusion: TurkicNLP填补了突厥语族NLP工具链的空白,推动该语系语言处理的标准化、可复现与协作发展。 Abstract: Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

[35] Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Simon Münker,Nils Schwager,Kai Kugler,Michael Heseltine,Achim Rettinger

Main category: cs.CL

TL;DR: 本文提出了一种基于真实X(原Twitter)数据的历史条件化回复预测任务,构建新数据集以评估大语言模型(LLM)生成文本与人类文本在风格和内容上的差异,并呼吁采用更精细的提示方法和专用数据集提升计算社会科学中LLM代理的有效性与真实性。

Details Motivation: LLM作为社会科学研究中人类参与者的代理虽具可扩展性和成本优势,但其‘朴素’应用(即无行为约束的提示)导致显著的语言偏差,威胁研究效度。 Method: 构建基于真实X数据的历史条件化回复预测任务及对应评测数据集,结合风格与内容维度的量化指标分析LLM与人类文本差异。 Result: 发现LLM生成文本与人类文本存在系统性语言偏差,现有朴素提示方法不足以复现人类语言复杂性。 Conclusion: 需发展更精细的提示技术与专用评测数据集,以提升LLM在计算社会科学中作为人类代理的语言真实性与研究效度。 Abstract: The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

[36] Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

Raihan Tanvir,Md. Golam Rabiul Alam

Main category: cs.CL

TL;DR: 本文提出xDORA框架及RAG-Fused DORA方法,通过融合多模态编码器与检索增强推理,在低资源语言(如孟加拉语)的仇恨模因检测任务中取得显著性能提升;同时验证了非参数FAISS分类器的有效性,并指出LLaVA等大模型在该场景下未经微调时效果有限。

Details Motivation: 针对低资源语言(如孟加拉语)中仇恨模因检测面临的标注数据少、类别不平衡和代码混合普遍等挑战,亟需更鲁棒的多模态建模范式。 Method: 1)扩展BHM数据集,融合MIMOSA数据以提升语义多样性与类别平衡;2)提出xDORA:结合CLIP/DINOv2视觉编码器与XGLM/XLM-R文本编码器,采用加权注意力池化学习跨模态表征;3)构建基于FAISS的k近邻非参数分类器;4)提出RAG-Fused DORA,引入检索驱动的上下文推理;5)评估LLaVA在零样本、少样本及检索增强提示下的表现。 Result: xDORA(CLIP+XLM-R)在仇恨模因识别和目标实体检测上分别达宏平均F1为0.78和0.71;RAG-Fused DORA进一步提升至0.79和0.74;FAISS分类器对罕见类鲁棒;LLaVA在少样本下效果有限,检索增强仅带来小幅提升。 Conclusion: 监督式、检索增强式及非参数多模态框架能更有效地应对低资源语言中语言与文化复杂性带来的仇恨内容检测挑战。 Abstract: Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

[37] Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Maryam Amirizaniani,Alireza Salemi,Hamed Zamani

Main category: cs.CL

TL;DR: 本文提出PR2框架,通过强化学习整合个性化检索与推理,提升问答系统的个性化能力。

Details Motivation: 现有基于RAG的个性化问答方法仅用用户查询直接检索个人资料,导致个性化流于表面,缺乏深度适配用户背景与偏好。 Method: 提出PR2(Personalized Retrieval-Augmented Reasoning),一种强化学习框架,学习自适应的检索-推理策略,动态决定何时检索、检索哪些个人资料证据,并在多步推理中融合这些证据;优化多轮推理轨迹,以个性化奖励函数为指导。 Result: 在LaMP-QA基准上使用三个大语言模型进行实验,PR2相较强基线平均相对提升8.8%–12%的个性化问答性能。 Conclusion: PR2通过将检索与推理联合建模并引入个性化奖励驱动的强化学习,显著提升了问答系统对用户背景与偏好的深层适配能力,验证了检索-推理协同优化对个性化QA的有效性。 Abstract: Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.

[38] Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang,Yi Li,Songtao Wei,Jinxin Yang,Ayushi Kishore,Alysa Zhao,Dingyi Kang,Xu Hu,Feng Chen,Qiannan Li,Bingzhe Li

Main category: cs.CL

TL;DR: 本文综述了面向LLM智能体的代理记忆系统(MAG),从架构与系统两个视角分析其现状、问题与改进方向。

Details Motivation: 现有代理记忆系统缺乏坚实的实证基础:基准测试规模不足、评估指标偏离语义效用、性能严重依赖骨干模型、系统开销常被忽视。 Method: 提出基于四种记忆结构的MAG系统分类法,并系统分析当前关键瓶颈,包括基准饱和、指标有效性、骨干模型依赖性及延迟/吞吐开销。 Result: 揭示了记忆结构与实证局限间的关联,解释了为何当前系统常低于理论预期,并指出了更可靠评估与可扩展系统设计的方向。 Conclusion: 需在评估方法、指标设计、模型无关性及系统成本建模等方面协同改进,才能实现稳健、实用的代理记忆系统。 Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

[39] PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Isun Chehreh,Ebrahim Ansari

Main category: cs.CL

TL;DR: 本文构建了首个大规模、类别均衡的波斯语社交媒体文本分类数据集(36,000条,9类各4,000条),采用混合标注(ChatGPT+人工校验)与语义去重/生成式增强策略,并在多个模型上评测,发现TookaBERT-Large性能最优(F1=0.9621),为波斯语NLP研究奠定基础。

Details Motivation: 解决波斯语社交媒体文本分类领域缺乏大规模、高质量、类别均衡数据集的问题。 Method: 构建包含9个类别、每类4,000条样本的均衡数据集;采集60,000原始帖子,经预处理后采用ChatGPT少样本提示+人工验证进行标注;使用语义冗余去除的欠采样与融合词法替换和生成式提示的数据增强;在BiLSTM、XLM-RoBERTa(LoRA/AdaLoRA)、FaBERT、SBERT及波斯专用TookaBERT等模型上进行基准评测。 Result: Transformer类模型显著优于传统神经网络;TookaBERT-Large取得最佳性能(Precision=0.9622,Recall=0.9621,F1=0.9621);各类别表现稳健,但社会与政治类因语义模糊性略低。 Conclusion: 本研究提供了首个高质量公开波斯语社交媒体文本分类数据集,并系统评估了前沿模型,为波斯语NLP(如舆情分析、用户建模等)提供了坚实基础和实用资源。 Abstract: This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

[40] Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Jasmin Han,Janardan Devkota,Joseph Waring,Amanda Luken,Felix Naughton,Roger Vilardaga,Jonathan Bricker,Carl Latkin,Meghan Moran,Yiqun Chen,Johannes Thrul

Main category: cs.CL

TL;DR: 本研究评估了大语言模型(LLMs)预测吸烟戒断信息对年轻成年吸烟者感知信息有效性(PME)的能力,提出并验证了融合个体特征与历史反馈的LLM数字孪生方法,在内容质量、应对支持和戒烟支持三方面均显著优于零/少样本LLM及监督学习基线模型。

Details Motivation: 准确预测潜在干预用户对戒烟信息的感知有效性(PME)对优化移动健康(mHealth)平台上的个性化干预至关重要,但传统方法难以兼顾个体差异与标注数据稀缺问题。 Method: 采用三种方法预测PME:(1)基于3010条5点李克特量表标注数据训练的监督学习模型;(2)未微调的零样本/少样本LLM;(3)整合个体特征与既往PME历史的LLM数字孪生模型。在每位参与者保留3条消息的测试集上,以准确率、Cohen's kappa和F1为指标评估性能。 Result: LLM数字孪生模型在三类PME预测任务中平均准确率分别达0.49(内容)、0.45(应对)、0.49(戒烟),方向性准确率(3点简化尺度)达0.75、0.66、0.70,较零/少样本LLM和监督模型分别高12和13个百分点;其预测分布更广,表明对个体差异更敏感。 Conclusion: 将个人档案融入LLM可有效捕捉PME的个体特异性,显著优于传统监督学习与提示工程方法;LLM数字孪生为mHealth中个性化戒烟及其他健康行为干预提供了新范式。 Abstract: Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.

[41] Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Arindam Khaled

Main category: cs.CL

TL;DR: 本文提出了一种名为Pyramid MoA的分层Mixture-of-Agents架构,通过轻量级Router动态升级查询,利用小型模型集合的语义一致性和置信度校准识别难题,在GSM8K上达到93.0%准确率,接近Oracle模型(98.0%),同时降低61%计算成本且仅增加0.82秒延迟。

Details Motivation: 大型语言模型在推理成本与推理能力之间存在持续权衡:大模型精度高但部署昂贵,小模型成本低但难以处理复杂任务。 Method: 提出Pyramid MoA分层混合专家架构,使用轻量级Router基于多个小模型间的语义一致性与置信度校准,动态判断并仅对‘困难’查询进行升级处理。 Result: 在GSM8K基准上达到93.0%准确率,接近Oracle模型(98.0%);计算成本降低61%,延迟仅增加0.82秒,并支持性能与预算间的可调权衡。 Conclusion: Pyramid MoA在保持较高推理性能的同时显著降低成本,为LLM高吞吐部署提供了高效可行的折中方案。 Abstract: Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.

[42] How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu,Shuo Lu,Jianjie Cheng,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He,Jian Liang

Main category: cs.CL

TL;DR: 本文系统研究了强化学习在深度研究代理(Deep Research agents)中的作用,从提示模板、奖励函数和策略优化三个维度进行解耦分析,并提出了改进的Search-R1++基线模型。

Details Motivation: 尽管强化学习已被证明能提升深度研究代理的性能,但其具体贡献尚未被充分理解,因此需要系统性研究。 Method: 沿提示模板(Fast/Slow Thinking)、奖励函数(F1、EM、带动作惩罚的F1)和策略优化方法(REINFORCE、PPO、GRPO)三个解耦维度开展实验分析,并基于发现构建Search-R1++模型。 Result: Fast Thinking模板更稳定高效;F1奖励因答案回避导致训练崩溃,加入动作惩罚后可超越EM;REINFORCE优于PPO且搜索动作更少,GRPO稳定性最差;Search-R1++显著提升Qwen2.5-7B和3B上的性能。 Conclusion: 强化学习在深度研究代理中的效果高度依赖于模板设计、奖励构造与策略算法选择,需协同优化;本研究为构建更可靠、可解释的RL训练策略提供了实证基础与实用指导。 Abstract: Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

[43] Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Rizhuo Huang,Yifan Feng,Rundong Xue,Shihui Ying,Jun-Hai Yong,Chuan Shi,Shaoyi Du,Yue Gao

Main category: cs.CL

TL;DR: 本文提出Hyper-KGGen框架,通过粗到细分解与自适应技能获取机制,解决跨领域知识超图构建中的场景鸿沟问题,并发布新基准HyperDocRED。

Details Motivation: 传统二元知识图谱难以表达复杂n元事实,而现有超图构建方法在跨领域泛化和结构-细节平衡上存在‘场景鸿沟’。 Method: 提出技能驱动的Hyper-KGGen框架:1)粗到细文档分解机制;2)基于稳定性反馈的自适应技能获取模块,动态构建全局技能库;并构建新基准HyperDocRED。 Result: 在多场景实验中,Hyper-KGGen显著优于强基线,验证演化技能比静态少样本示例提供更丰富的指导。 Conclusion: 动态技能演化是提升知识超图抽取跨领域鲁棒性与细粒度表现的有效范式。 Abstract: Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex $n$-ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \textit{scenario gap}: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \textbf{Hyper-KGGen}, a skill-driven framework that reformulates extraction as a dynamic skill-evolving process. First, Hyper-KGGen employs a \textit{coarse-to-fine} mechanism to systematically decompose documents, ensuring full-dimensional coverage from binary links to complex hyperedges. Crucially, it incorporates an \textit{adaptive skill acquisition} module that actively distills domain expertise into a Global Skill Library. This is achieved via a stability-based feedback loop, where extraction stability serves as a relative reward signal to induce high-quality skills from unstable traces and missed predictions. Additionally, we present \textbf{HyperDocRED}, a rigorously annotated benchmark for document-level knowledge hypergraph extraction. Experiments demonstrate that Hyper-KGGen significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.

[44] Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

Jeffrey Li,Josh Gardner,Doug Kang,Fangping Shi,Karanjeet Singh,Chun-Liang Li,Herumb Shandilya,David Hall,Oncel Tuzel,Percy Liang,Ludwig Schmidt,Hadi Pour Ansari,Fartash Faghri

Main category: cs.CL

TL;DR: 本文研究了在构建大规模语言模型预训练数据集时,HTML文本提取器的选择对数据覆盖和下游任务性能的影响,发现使用多种提取器的并集可显著提升token产量且不损害性能,尤其对表格和代码等结构化内容影响显著。

Details Motivation: 现有开源数据集普遍采用单一固定HTML文本提取器,可能导致互联网数据覆盖和利用不足。 Method: 通过对比不同HTML文本提取器的效果,分析其对网页保留率、token产量及下游任务性能的影响,并提出采用多个提取器结果的并集策略。 Result: 采用提取器并集策略使DCLM-Baseline的token产量提升高达71%,且标准语言理解任务性能不变;在WikiTQ和HumanEval上,不同提取器对结构化内容处理的性能差异分别达10和3个百分点。 Conclusion: 固定单一提取器存在局限性,采用多提取器并集是一种简单有效提升数据利用率和特定下游任务性能的方法。 Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

[45] Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Jiahao Huo,Shuliang Liu,James Kwok,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出Prune-then-Merge两阶段框架,先自适应剪枝低信息图像块,再分层合并剩余高信噪比嵌入,以在视觉文档检索中实现高效且高保真的多向量压缩。

Details Motivation: 现有视觉文档检索(VDR)的多向量范式性能好但开销大;现有压缩方法(如剪枝、合并)难以兼顾高压缩率与特征保真度,存在明显权衡困境。 Method: 提出Prune-then-Merge两阶段框架:第一阶段为自适应剪枝,剔除低信息图像块,保留高信号嵌入;第二阶段为分层合并,对剪枝后的嵌入进行语义压缩,避免噪声干扰导致的特征稀释。 Result: 在29个VDR数据集上实验表明,该方法显著扩展了近无损压缩范围,在高压缩比下仍保持鲁棒性能,持续优于现有方法。 Conclusion: Prune-then-Merge通过协同剪枝与合并,有效缓解了效率与保真度之间的矛盾,为多向量VDR提供了更优的轻量化解决方案。 Abstract: Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

[46] Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering

Wuzhenghong Wen,Bowen Zhou,Jinwen Huang,Xianjie Wu,Yuwei Sun,Su Pan,Liang Li,Jianting Liu

Main category: cs.CL

TL;DR: 本文提出了一种面向时序知识图谱问答(TKGQA)的新框架,通过时序感知的问题编码、多跳图推理和多视图异构信息融合,显著提升了对时间敏感查询的处理能力。

Details Motivation: 现有TKGQA方法存在三个主要问题:问题表示中时间约束融入不足导致推理偏差、显式多跳推理能力有限、语言与图表示融合效果欠佳。 Method: 提出一种新框架,包含:1)结合语言模型语义线索与时间实体动态的约束感知问题表示;2)基于时间感知消息传递的时序图神经网络以支持显式多跳推理;3)多视图注意力机制实现问题上下文与时序图知识的有效融合。 Result: 在多个TKGQA基准上实验表明,该方法持续优于多个基线模型。 Conclusion: 所提框架有效缓解了当前TKGQA方法在时间建模、多跳推理与跨模态融合方面的局限性,提升了时序问答性能。 Abstract: Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.

[47] DEEP: Docker-based Execution and Evaluation Platform

Sergio Gómez González,Miguel Domingo,Francisco Casacuberta

Main category: cs.CL

TL;DR: 本文提出了一种名为DEEP的自动化评估软件,用于机器翻译和OCR模型的执行与评分,并支持扩展至其他任务;它通过Docker运行模型、统计分析结果显著性进行性能聚类,并提供可视化Web应用辅助结果解读。

Details Motivation: 解决系统比较评估中重复性高、人工成本大、结果解释难的问题,尤其在模型选型、研究验证及公开竞赛评测中亟需自动化、可复现且具统计意义的评估工具。 Method: 开发基于Docker的可扩展评估框架DEEP,集成自动执行、多指标评分、基于统计显著性分析的聚类算法,并配套可视化Web应用;以假设-参考对为输入,提取运行时信息并量化性能差异。 Result: DEEP成功实现了MT与OCR任务的自动化评估,能识别性能相似的模型簇,提升评估结果的可解释性与统计可信度,并通过实际案例验证了其有效性与易用性。 Conclusion: DEEP为AI系统评估提供了标准化、自动化、可解释的新范式,兼具工程实用性与方法学创新,有望推广至更广泛的AI评测场景。 Abstract: Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.

[48] Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

Deborah N. Jakobi,David R. Reich,Paul Prasse,Jana M. Hofmann,Lena S. Bolliger,Lena A. Jäger

Main category: cs.CL

TL;DR: 本文旨在提升眼动阅读数据集的透明度与可重用性,通过提供现有数据集的全面综述、建立在线动态数据集特征概览页面,并将公开数据集集成至Python库pymovements中,以支持FAIR原则和科研可重复性。

Details Motivation: 现有眼动阅读数据集分散于不同学科,缺乏统一的数据共享标准,导致互操作性差、复用困难。 Method: 1)系统梳理并综述现有眼动阅读数据集;2)构建并维护一个在线动态网页(https://dili-lab.github.io/datasets.html),涵盖45+项数据集特征;3)将所有公开数据集整合进Python库pymovements,提供标准化访问接口。 Result: 建立了覆盖45+特征的在线数据集概览平台,集成了多个公开眼动阅读数据集的pymovements库,提升了数据发现性、可访问性与互操作性。 Conclusion: 该工作显著推动了眼动阅读研究领域的FAIR(可发现、可访问、可互操作、可重用)实践,为跨学科协作、结果复现与方法验证提供了基础设施支持。 Abstract: Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili-lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.

[49] Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Borisiuk Anna,Andrey Savchenko,Alexander Panchecko,Elena Tutubalina

Main category: cs.CL

TL;DR: 本文提出DUAL基准,用于评估大语言模型在不同训练阶段(预训练与监督微调)的知识遗忘效果,发现监督微调阶段进行遗忘更稳定、高效,而直接在预训练模型上遗忘则易导致重学或灾难性遗忘。

Details Motivation: 现有机器遗忘方法假设所有事实同等可遗忘,忽视知识来源(预训练 vs. 监督微调),缺乏对不同训练阶段遗忘行为差异的系统评估。 Method: 构建DUAL基准:28.6k个Wikidata三元组,标注事实流行度(维基链接数+LLM显著性得分);对比预训练模型与SFT模型在遗忘任务中的表现,分析遗忘平滑性、稳定性与保留率。 Result: SFT模型在遗忘任务中表现出更平滑的遗忘过程、更稳定的调优效果及10–50%更高的知识保留率;预训练模型直接遗忘则不稳定,易发生重学或灾难性遗忘。 Conclusion: 遗忘效果高度依赖知识所处的训练阶段,应在监督微调阶段而非预训练阶段实施更可控、高效的机器遗忘。 Abstract: Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

[50] KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Alex Robertson,Huizhi Liang,Mahbub Gani,Rohit Kumar,Srijith Rajamohan

Main category: cs.CL

TL;DR: 本文提出KGHaluBench,一个基于知识图谱的幻觉评测基准,通过动态生成多维度问题和自动化验证流程,全面评估大语言模型(LLM)在广度与深度知识上的真实性,并揭示导致幻觉的知识因素。

Details Motivation: 现有幻觉评测基准受限于静态、窄域问题,覆盖不足且易产生误导性评估,亟需更全面、公平的评测方法。 Method: 构建基于知识图谱(KG)的动态问题生成框架,结合统计难度估计缓解流行度偏差;设计自动化验证流水线,在概念层与正确性层检测不同类型的幻觉,并引入新型准确率与幻觉指标。 Result: 在25个前沿大模型上完成评测,结果揭示了不同模型规模下导致幻觉的关键知识因素,提供了更具可解释性的分析;KGHaluBench已开源。 Conclusion: KGHaluBench显著提升了LLM幻觉评测的广度、深度与公平性,为后续幻觉缓解研究提供了可靠基准与洞见。 Abstract: Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.

cs.CV [Back]

[51] Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

Suraj Prasad,Anubha Pant

Main category: cs.CV

TL;DR: 本文对FedTPG方法进行了忠实复现,在六个视觉数据集上验证了其在联邦学习中通过文本驱动提示生成提升对未见类泛化能力的有效性,结果与原论文高度一致,证实了该方法的鲁棒性与可复现性。

Details Motivation: 解决视觉-语言模型(如CLIP)在联邦学习场景下对未见类别泛化能力差的问题。 Method: 复现FedTPG方法,即采用文本驱动的提示生成网络,根据类别名称动态生成提示,并在联邦设置下训练该网络。 Result: 在六个数据集上的复现实验结果与原论文相差仅0.2%,平均准确率:已见类74.58%,未见类76.00%,未见类泛化性能提升1.43个百分点。 Conclusion: FedTPG方法能有效提升联邦学习中对未见类的泛化能力,且不依赖私有数据共享;复现实验验证了其核心主张的正确性、鲁棒性与可复现性。 Abstract: Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2\% of the original paper's reported accuracies, with an average accuracy of 74.58\% on seen (base) classes and 76.00\% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper's core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.

[52] A Patient-Specific Digital Twin for Adaptive Radiotherapy of Non-Small Cell Lung Cancer

Anvi Sud,Jialu Huang,Gregory R. Hart,Keshav Saxena,John Kim,Lauren Tressel,Jun Deng

Main category: cs.CV

TL;DR: 本文提出COMPASS系统,利用时序AI模型(GRU自编码器)整合PET/CT影像、放疗剂量及生物等效剂量(BED)动态数据,构建个体化正常组织数字孪生模型,实现对NSCLC患者放疗中早期毒性风险的动态预测。

Details Motivation: 传统NTCP模型基于静态、群体数据,无法刻画个体正常组织随时间演化的动态生物学轨迹;而BGRT产生丰富的时序多模态数据,亟需AI驱动的动态建模方法。 Method: 构建COMPASS时序数字孪生架构,融合每分次PET、CT、dosimetrics、radiomics及累积BED动力学;采用GRU自编码器学习器官特异性潜在时序轨迹,并用逻辑回归分类预测CTCAE≥1级毒性。 Result: 在8例NSCLC患者(99个器官分次观测、24条器官轨迹)上验证,系统可在临床毒性出现前数个分次发出风险预警;BED驱动表征揭示了传统体积剂量学所掩盖的、具生物学意义的空间剂量纹理特征。 Conclusion: COMPASS为AI赋能的自适应放疗提供了概念验证,表明基于个体化动态数字孪生的实时生物响应监测可提升放疗安全性与精准性。 Abstract: Radiotherapy continues to become more precise and data dense, with current treatment regimens generating high frequency imaging and dosimetry streams ideally suited for AI driven temporal modeling to characterize how normal tissues evolve with time. Each fraction in biologically guided radiotherapy(BGRT) treated non small cell lung cancer (NSCLC) patients records new metabolic, anatomical, and dose information. However, clinical decision making is largely informed by static, population based NTCP models which overlook the dynamic, unique biological trajectories encoded in sequential data. We developed COMPASS (Comprehensive Personalized Assessment System) for safe radiotherapy, functioning as a temporal digital twin architecture utilizing per fraction PET, CT, dosiomics, radiomics, and cumulative biologically equivalent dose (BED) kinetics to model normal tissue biology as a dynamic time series process. A GRU autoencoder was employed to learn organ specific latent trajectories, which were classified via logistic regression to predict eventual CTCAE grade 1 or higher toxicity. Eight NSCLC patients undergoing BGRT contributed to the 99 organ fraction observations covering 24 organ trajectories (spinal cord, heart, and esophagus). Despite the small cohort, intensive temporal phenotyping allowed for comprehensive analysis of individual dose response dynamics. Our findings revealed a viable AI driven early warning window, as increasing risk ratings occurred from several fractions before clinical toxicity. The dense BED driven representation revealed biologically relevant spatial dose texture characteristics that occur before toxicity and are averaged out with traditional volume based dosimetry. COMPASS establishes a proof of concept for AI enabled adaptive radiotherapy, where treatment is guided by a continually updated digital twin that tracks each patients evolving biological response.

[53] Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality

Kian Wei Ng,Yujia Gao,Deborah Khoo,Ying Zhen Tan,Chengzheng Mao,Haojie Cheng,Andrew Makmur,Kee Yuan Ngiam,Serene Goh,Eng Tat Khoo

Main category: cs.CV

TL;DR: 本文提出了一种名为MARVUS的移动增强现实体积超声系统,利用基础模型和AR可视化,在不增加硬件成本的前提下,显著提升了2D超声下病灶体积测量的准确性与可重复性。

Details Motivation: 2D超声虽为乳腺与甲状腺检查首选,但其体积估计存在高操作者间变异;现有3D超声方案因需专用探头或外部追踪设备而成本高、便携性差,限制临床普及。 Method: 提出MARVUS系统:兼容常规超声设备,采用基础模型提升跨专科泛化能力,结合移动AR实现无额外硬件的体积重建与可视化;在乳腺体模上开展多医师用户研究并评估精度与一致性。 Result: 在乳腺体模实验中,MARVUS将体积估计平均误差降至0.469 cm³,操作者间差异均值降至0.417 cm³;AR可视化被证实可提升客观性能指标与主观可用性评分。 Conclusion: MARVUS是一种低成本、高便携、易部署的解决方案,有望在资源受限环境中提升超声驱动的癌症筛查、诊断及治疗规划的准确性与可扩展性。 Abstract: Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US suffer from high inter-user variability even among experienced clinicians. Existing 3D ultrasound (3D-US) solutions use specialized probes or external tracking hardware, but such configurations increase costs and diminish portability, constraining widespread clinical use. To address these limitations, we present Mobile Augmented Reality Volumetric Ultrasound (MARVUS), a resource-efficient system designed to increase accessibility to accurate and reproducible volumetric assessment. MARVUS is interoperable with conventional ultrasound (US) systems, using a foundation model to enhance cross-specialty generalization while minimizing hardware requirements relative to current 3D-US solutions. In a user study involving experienced clinicians performing measurements on breast phantoms, MARVUS yielded a substantial improvement in volume estimation accuracy (mean difference: 0.469 cm3) with reduced inter-user variability (mean difference: 0.417 cm3). Additionally, we prove that augmented reality (AR) visualizations enhance objective performance metrics and clinician-reported usability. Collectively, our findings suggests that MARVUS can enhance US-based cancer screening, diagnostic workflows, and treatment planning in a scalable, cost-conscious, and resource-efficient manner. Usage video demonstration available (https://youtu.be/m4llYcZpqmM).

[54] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

Sarah Müller,Philipp Berens

Main category: cs.CV

TL;DR: 本文系统评估了特征解耦方法在医学影像中缓解捷径学习的效果,发现结合数据重平衡与模型中心解耦的方法在强虚假相关下能更稳健地提升分类性能。

Details Motivation: 深度学习模型在医学影像中易依赖捷径学习(利用虚假相关或混杂因素),影响跨机构、人群和设备的泛化能力,存在临床风险。 Method: 系统评估多种特征解耦方法,包括对抗学习和基于依赖最小化的潜在空间分割,并在一个人工和两个真实医学数据集上分析分类性能、解耦质量、鲁棒性及计算效率。 Result: 捷径缓解方法在训练中存在强虚假相关时提升了分类性能;潜在空间分析揭示了分类指标无法捕捉的表征差异;模型对捷径的依赖程度取决于训练数据中混杂程度;最佳效果来自数据重平衡与模型解耦的结合。 Conclusion: 结合数据驱动重平衡与模型驱动解耦的方法,在保持计算效率的同时,比单一重平衡策略实现了更强、更稳健的捷径学习缓解效果。 Abstract: Although deep learning models in medical imaging often achieve excellent classification performance, they can rely on shortcut learning, exploiting spurious correlations or confounding factors that are not causally related to the target task. This poses risks in clinical settings, where models must generalize across institutions, populations, and acquisition conditions. Feature disentanglement is a promising approach to mitigate shortcut learning by separating task-relevant information from confounder-related features in latent representations. In this study, we systematically evaluated feature disentanglement methods for mitigating shortcuts in medical imaging, including adversarial learning and latent space splitting based on dependence minimization. We assessed classification performance and disentanglement quality using latent space analyses across one artificial and two medical datasets with natural and synthetic confounders. We also examined robustness under varying levels of confounding and compared computational efficiency across methods. We found that shortcut mitigation methods improved classification performance under strong spurious correlations during training. Latent space analyses revealed differences in representation quality not captured by classification metrics, highlighting the strengths and limitations of each method. Model reliance on shortcuts depended on the degree of confounding in the training data. The best-performing models combine data-centric rebalancing with model-centric disentanglement, achieving stronger and more robust shortcut mitigation than rebalancing alone while maintaining similar computational efficiency.

[55] A Computer Vision Framework for Multi-Class Detection and Tracking in Soccer Broadcast Footage

Daniel Tshiani

Main category: cs.CV

TL;DR: 本文提出了一种基于单路广播视频的端到端计算机视觉系统,利用YOLO检测器和ByteTrack跟踪算法,实现对球员、裁判、守门员和球的检测与跟踪,旨在为预算有限的球队提供低成本、可扩展的足球数据分析方案。

Details Motivation: 高成本多摄像头或GPS系统使部分球队获得数据优势,而低预算球队难以获取类似数据;本文旨在探索能否仅用标准转播视频(单摄像头)提取同等价值的空间数据。 Method: 构建端到端单相机视觉流水线:采用YOLO进行目标检测,结合ByteTrack实现多目标跟踪,覆盖球员、裁判、守门员和球。 Result: 系统在球员和裁判检测与跟踪上表现优异(高精度、召回率和mAP50),但球的检测仍是主要挑战;仍能有效提取有意义的球员级空间信息。 Conclusion: AI可从单路广播视频中提取实用的足球空间数据,降低对专用硬件的依赖,使高校、青训营和业余俱乐部也能负担并应用数据驱动分析,推动足球分析民主化。 Abstract: Clubs with access to expensive multi-camera setups or GPS tracking systems gain a competitive advantage through detailed data, whereas lower-budget teams are often unable to collect similar information. This paper examines whether such data can instead be extracted directly from standard broadcast footage using a single-camera computer vision pipeline. This project develops an end-to-end system that combines a YOLO object detector with the ByteTrack tracking algorithm to identify and track players, referees, goalkeepers, and the ball throughout a match. Experimental results show that the pipeline achieves high performance in detecting and tracking players and officials, with strong precision, recall, and mAP50 scores, while ball detection remains the primary challenge. Despite this limitation, our findings demonstrate that AI can extract meaningful player-level spatial information from a single broadcast camera. By reducing reliance on specialized hardware, the proposed approach enables colleges, academies, and amateur clubs to adopt scalable, data-driven analysis methods previously accessible only to professional teams, highlighting the potential for affordable computer vision-based soccer analytics.

[56] Suppression or Deletion: A Restoration-Based Representation-Level Analysis of Machine Unlearning

Yurim Jang,Jaeung Lee,Dohyun Kim,Jaemin Jo,Simon S. Woo

Main category: cs.CV

TL;DR: 本文提出了一种基于恢复的分析框架,利用稀疏自编码器识别中间层的类别特异性专家特征,并通过推理时引导来区分信息抑制与删除,发现大多数遗忘方法仅在决策边界层面抑制信息,而未真正删除中间表示中的语义特征。

Details Motivation: 现有机器遗忘评估依赖输出指标,无法验证敏感信息是否被真正删除还是仅被抑制在表征层面,而抑制不足以实现真正的遗忘。 Method: 提出基于恢复的分析框架,使用稀疏自编码器识别中间层类别特异性专家特征,并结合推理时引导技术定量区分抑制与删除;在图像分类任务中对12种主流遗忘方法进行评估。 Result: 多数遗忘方法具有高信息恢复率,表明其仅在决策边界层面抑制信息,中间层语义特征仍被保留;甚至从预训练检查点重新训练也无法消除这些鲁棒语义特征。 Conclusion: 表征层面的信息保留构成被输出指标忽视的重大风险,亟需以表征级验证为优先的新遗忘评估标准,尤其在隐私敏感场景下。 Abstract: As pretrained models are increasingly shared on the web, ensuring that models can forget or delete sensitive, copyrighted, or private information upon request has become crucial. Machine unlearning has been proposed to address this challenge. However, current evaluations for unlearning methods rely on output-based metrics, which cannot verify whether information is completely deleted or merely suppressed at the representation level, where suppression is insufficient for true unlearning. To address this gap, we propose a novel restoration-based analysis framework that uses Sparse Autoencoders to identify class-specific expert features in intermediate layers and applies inference-time steering to quantitatively distinguish between suppression and deletion. Applying our framework to 12 major unlearning methods in image classification tasks, we find that most methods achieve high restoration rates of unlearned information, indicating that they only suppress information at the decision-boundary level, while preserving semantic features in intermediate representations. Notably, even retraining from pretrained checkpoints shows high restoration, revealing that robust semantic features inherited from pretraining are not removed by retraining. These results demonstrate that representation-level retention poses significant risks overlooked by output-based metrics, highlighting the need for new unlearning evaluation criteria. We propose new evaluation guidelines that prioritize representation-level verification, especially for privacy-critical applications in the era of pre-trained models.

[57] Depth from Defocus via Direct Optimization

Holly Jackson,Caleb Adams,Ignacio Lopez-Francos,Benjamin Recht

Main category: cs.CV

TL;DR: 本文提出了一种基于交替最小化的全局优化方法,用于从离焦图像中恢复深度图,在高分辨率下优于现有深度学习方法。

Details Motivation: 尽管存在基于光学物理的合理模糊前向模型,但从离焦图像中恢复深度仍是一个计算上具有挑战性的优化问题。 Method: 采用交替最小化策略:固定深度图时,对全焦图像进行凸优化;固定全焦图像时,对每个像素深度独立进行并行网格搜索。 Result: 在合成与真实离焦模糊数据集上的实验表明,该方法在高分辨率下效果优于当前深度学习方法,并开源了代码。 Conclusion: 利用现代优化方法和合理计算资源,全局优化方法可有效解决深度从离焦问题,且具备可扩展性和实用性。 Abstract: Though there exists a reasonable forward model for blur based on optical physics, recovering depth from a collection of defocused images remains a computationally challenging optimization problem. In this paper, we show that with contemporary optimization methods and reasonable computing resources, a global optimization approach to depth from defocus is feasible. Our approach rests on alternating minimization. When holding the depth map fixed, the forward model is linear with respect to the all-in-focus image. When holding the all-in-focus image fixed, the depth at each pixel can be computed independently, enabling embarrassingly parallel computation. We show that alternating between convex optimization and parallel grid search can effectively solve the depth-from-defocus problem at higher resolutions than current deep learning methods. We demonstrate our approach on benchmark datasets with synthetic and real defocus blur and show promising results compared to prior approaches. Our code is available at github.com/hollyjackson/dfd.

[58] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Aayam Bansal

Main category: cs.CV

TL;DR: 本文提出Sketch2Feedback框架,通过将图表反馈任务分解为感知、符号图构建、约束检查和受限VLM反馈四个阶段,利用规则引擎验证后再由语言模型生成解释,从而显著降低大模型在教育场景中的幻觉问题。

Details Motivation: 在STEM教育中,及时、符合评分标准的学生手绘图表反馈存在挑战;而现有大语言多模态模型易产生幻觉,影响教学可信度。 Method: 提出Grammar-in-the-loop的Sketch2Feedback框架,包含混合感知、符号图构建、约束检查和受限VLM反馈四阶段;在FBD-10与Circuit-10两个合成微基准上评估,并对比多种基线方法(如LLaVA、Qwen2-VL、YOLOv8等)及置信度阈值调整策略。 Result: Qwen2-VL-7B在F1指标上最高(FBD: 0.570, Circuit: 0.528),但幻觉率极高(0.78/0.98);而Sketch2Feedback虽F1略低,但幻觉率大幅下降,且人工评估显示其电路反馈更可操作(4.85 vs 3.11);硬噪声下FBD鲁棒性强于电路。 Conclusion: 引入符号规则约束能有效抑制多模态大模型幻觉,提升教育反馈系统的可靠性与可解释性;Grammar-in-the-loop是平衡准确性与可信性的可行路径。 Abstract: Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages -- hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback -- so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.

[59] Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

Vasile Marian,Yong-Bin Kang,Alexander Buddery

Main category: cs.CV

TL;DR: 本文系统评估了合成图像在YOLOv11目标检测任务中的增强效果,发现在不同数据集(Traffic Signs、Cityscapes Pedestrian、COCO PottedPlant)上增益差异显著;同时指出传统全局生成指标(如FID)难以预测下游mAP,而引入对象中心的分布距离等新指标可提升评估可靠性。

Details Motivation: 现有全局生成质量指标(如FID)无法可靠预测合成图像在目标检测任务中的实际性能,亟需建立与下游任务更一致的预训练数据评估方法。 Method: 在三个单类检测场景下,对六种GAN/扩散/混合生成器、不同合成数据比例(10%–150%)及两种训练策略(从头训练/预训练微调)进行控制实验;采用Inception-v3/DINOv2特征空间度量和边界框统计的对象中心距离作为预训练评估指标,并通过残差化分析剥离数据量影响。 Result: 合成增强在Pedestrian和PottedPlant上带来显著mAP提升(+7.6%和+30.6%),但在Traffic Signs及预训练微调中效果有限;多数原始指标与性能的相关性在控制合成比例后大幅减弱,表明指标有效性高度依赖任务场景。 Conclusion: 合成数据评估不能依赖通用生成指标,应结合任务特性设计对象中心、分布感知的度量方式,并严格控制数据量混杂效应。 Abstract: Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.

[60] JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Zhan Liu,Changli Tang,Yuxin Wang,Zhiyuan Zhu,Youjun Chen,Yiwen Shao,Tianzi Wang,Lei Ke,Zengrui Jin,Chao Zhang

Main category: cs.CV

TL;DR: 本文提出JAEGER框架,通过融合RGB-D视觉与多通道一阶球谐音频,将音视频大模型扩展至3D空间,并引入神经强度向量(Neural IV)提升方向到达估计能力,在SpatialSceneQA基准上显著超越2D基线。

Details Motivation: 现有音视频大语言模型局限于2D感知(RGB视频+单声道音频),导致在复杂3D环境中无法可靠进行声源定位和空间推理,存在根本性的维度不匹配问题。 Method: 提出JAEGER框架,整合RGB-D观测与一阶球谐多通道音频;设计可学习的神经强度向量(Neural IV)表征空间音频方向信息;构建含61k样本的3D空间问答基准SpatialSceneQA用于训练与评估。 Result: 在多种空间感知与推理任务上,JAEGER持续显著优于2D基线模型;Neural IV在混叠声源和恶劣声学条件下仍保持稳健的方向估计性能。 Conclusion: 显式建模3D空间信息对提升AI在物理环境中的感知与推理能力至关重要,JAEGER为AV-LLMs迈向真实3D交互提供了可行路径。 Abstract: Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

[61] Image-Based Classification of Olive Varieties Native to Turkiye Using Multiple Deep Learning Architectures: Analysis of Performance, Complexity, and Generalization

Hatice Karatas,Irfan Atabas

Main category: cs.CV

TL;DR: 本研究比较了十种深度学习架构在土耳其五种黑橄榄品种图像分类任务中的性能,发现EfficientNetV2-S准确率最高(95.8%),而EfficientNetB0在精度与计算复杂度间平衡最佳;结果表明,在数据有限条件下,参数效率比单纯增加模型深度更重要。

Details Motivation: 针对土耳其本地栽培的五种黑橄榄品种(Gemlik、Ayvalik、Uslu、Erkence、Celebi)缺乏自动化图像识别方法的问题,探索适用于小规模数据集的高效深度学习模型。 Method: 采用迁移学习策略,在包含2500张图像的数据集上训练并评估MobileNetV2、EfficientNetB0、EfficientNetV2-S、ResNet50、ResNet101、DenseNet121、InceptionV3、ConvNeXt-Tiny、ViT-B16和Swin-T共十种模型;评估指标涵盖准确率、精确率、召回率、F1分数、MCC、Cohen's Kappa、ROC-AUC、参数量、FLOPs、推理时间和泛化间隙。 Result: EfficientNetV2-S取得最高分类准确率(95.8%);EfficientNetB0在准确率与计算开销(参数量、FLOPs、推理时间)之间表现最优;实验表明参数效率比模型深度对小数据场景下性能影响更大。 Conclusion: 在有限图像数据条件下,选择参数高效、轻量化的模型(如EfficientNet系列)比追求更深或更大规模的架构更有利于实现高精度与实用性的平衡;该结论对农业领域资源受限的视觉识别任务具有指导意义。 Abstract: This study compares multiple deep learning architectures for the automated, image-based classification of five locally cultivated black table olive varieties in Turkey: Gemlik, Ayvalik, Uslu, Erkence, and Celebi. Using a dataset of 2500 images, ten architectures - MobileNetV2, EfficientNetB0, EfficientNetV2-S, ResNet50, ResNet101, DenseNet121, InceptionV3, ConvNeXt-Tiny, ViT-B16, and Swin-T - were trained using transfer learning. Model performance was evaluated using accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), Cohen's Kappa, ROC-AUC, number of parameters, FLOPs, inference time, and generalization gap. EfficientNetV2-S achieved the highest classification accuracy (95.8%), while EfficientNetB0 provided the best trade-off between accuracy and computational complexity. Overall, the results indicate that under limited data conditions, parametric efficiency plays a more critical role than model depth alone.

[62] VLANeXt: Recipes for Building Strong VLA Models

Xiao-Ming Wu,Bin Fan,Kang Liao,Jian-Jian Jiang,Runze Yang,Yihang Luo,Zhonghua Wu,Wei-Shi Zheng,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了一种统一框架来系统分析视觉-语言-动作(VLA)模型的设计空间,通过解构基础组件、感知要素和动作建模三个维度,总结出12条关键设计经验,并基于此构建了简单高效的新模型VLANeXt,在LIBERO系列基准和真实世界实验中均超越现有SOTA。

Details Motivation: 当前VLA领域存在训练协议与评估标准不统一、设计选择难以比较的问题,亟需系统性、结构化的分析以识别真正有效的设计原则。 Method: 在统一框架和评估设置下,从类似RT-2和OpenVLA的简单VLA基线出发,沿基础组件、感知要素、动作建模三个维度系统消融与分析各类设计选择。 Result: 提炼出12条关键设计发现,构建出新模型VLANeXt,在LIBERO和LIBERO-plus基准上超越先前SOTA,并在真实机器人任务中展现强泛化能力;同时开源统一易用的代码库。 Conclusion: VLA性能提升并非依赖复杂架构,而在于若干关键、可复现的设计选择;VLANeXt验证了简洁设计的有效性,为社区提供了可复现、可扩展的VLA研发基础。 Abstract: Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

[63] Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

Andrew Fraser

Main category: cs.CV

TL;DR: 本文揭示了形态学压力在文生图生成模型中创建可导航的梯度,证明了特征描述符和提示词层面的形态结构(如音义关联)能系统性地引导扩散模型潜在空间,实现无需目标名称或图像的特定身份生成,并发现音素结构可直接催生新颖、一致的视觉概念。

Details Motivation: 探索文本到图像生成模型中,形态学结构(如特征描述符和音素模式)是否能在不依赖目标名称或真实图像的情况下,系统性地引导模型生成特定视觉身份,并理解其在潜在空间中的导航机制。 Method: 研究1:使用Stable Diffusion 1.5,通过形态描述符(如“铂金色头发”)导航身份盆地,并构建自蒸馏循环(用描述符生成图像→训练LoRA);研究2:基于音素理论构造200个无意义新词,评估其在生成视觉一致性上的效果,并与随机对照比较。 Result: 研究1中LoRA可稳定收敛至特定身份(ArcFace相似度验证),并产生‘恐怖谷’式反向条件输出;研究2中音素承载词显著提升视觉一致性(Purity@1=0.371 vs. 0.209, p<0.00001),三个新词(如snudgeoid)达100%一致性且无数据污染。 Conclusion: 形态结构——无论是特征级描述还是提示词级音系形式——均在扩散模型潜在空间中构建出系统性、可导航的梯度,支持零样本身份生成与从亚词汇声音模式涌现新视觉概念。 Abstract: We demonstrate that morphological pressure creates navigable gradients at multiple levels of the text-to-image generative pipeline. In Study~1, identity basins in Stable Diffusion 1.5 can be navigated using morphological descriptors -- constituent features like platinum blonde,'' beauty mark,'' and 1950s glamour'' -- without the target's name or photographs. A self-distillation loop (generating synthetic images from descriptor prompts, then training a LoRA on those outputs) achieves consistent convergence toward a specific identity as measured by ArcFace similarity. The trained LoRA creates a local coordinate system shaping not only the target identity but also its inverse: maximal away-conditioning produces eldritch'' structural breakdown in base SD1.5, while the LoRA-equipped model produces ``uncanny valley'' outputs -- coherent but precisely wrong. In Study~2, we extend this to prompt-level morphology. Drawing on phonestheme theory, we generate 200 novel nonsense words from English sound-symbolic clusters (e.g., \emph{cr-}, \emph{sn-}, \emph{-oid}, \emph{-ax}) and find that phonestheme-bearing candidates produce significantly more visually coherent outputs than random controls (mean Purity@1 = 0.371 vs.\ 0.209, p<0.00001p < 0.00001 p<0.00001, Cohen's d=0.55d = 0.55 d=0.55). Three candidates -- \emph{snudgeoid}, \emph{crashax}, and \emph{broomix} -- achieve perfect visual consistency (Purity@1 = 1.0) with zero training data contamination, each generating a distinct, coherent visual identity from phonesthetic structure alone. Together, these studies establish that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces. We document phase transitions in identity basins, CFG-invariant identity stability, and novel visual concepts emerging from sub-lexical sound patterns.

[64] Rodent-Bench

Thomas Heap,Laurence Aitchison,Emma Cahill,Adriana Casado Rodriguez

Main category: cs.CV

TL;DR: Rodent-Bench is a new benchmark for evaluating MLLMs on rodent behavior video annotation; current top models (e.g., Gemini-2.5-Pro, Qwen-VL-Max) perform poorly—especially in temporal segmentation and subtle behavior discrimination—highlighting key limitations and guiding future development.

Details Motivation: To rigorously assess and expose the limitations of current Multimodal Large Language Models (MLLMs) in scientifically accurate, fine-grained annotation of long, complex rodent behavioral videos—a critical need in neuroscience. Method: Constructed Rodent-Bench: a diverse, standardized benchmark with multi-paradigm rodent behavior videos (10–35 min), two version variants, and evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient; evaluated Gemini-2.5-Pro, Gemini-2.5-Flash, and Qwen-VL-Max. Result: All tested state-of-the-art MLLMs underperformed—none reached usable accuracy for real-world assistance; modest success only on grooming detection; major failures in temporal segmentation, long-video handling, and distinguishing subtle behaviors (e.g., freezing vs. immobility). Conclusion: Current MLLMs are insufficient for reliable automated rodent behavior annotation; Rodent-Bench establishes a foundational, standardized evaluation framework to drive targeted improvements in temporal reasoning, long-context multimodal understanding, and fine-grained behavioral discrimination for neuroscience applications. Abstract: We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.

[65] BloomNet: Exploring Single vs. Multiple Object Annotation for Flower Recognition Using YOLO Variants

Safwat Nusrat,Prithwiraj Bhattacharjee

Main category: cs.CV

TL;DR: 本文对多种YOLO架构(YOLOv5s、YOLOv8n/s/m、YOLOv12n)在花卉检测任务上的性能进行了基准测试,提出FloralSix数据集,并比较了单框与多框标注方式下的检测效果,发现YOLOv8m在稀疏场景、YOLOv12n在密集场景表现最优,且SGD优化器整体更优。

Details Motivation: 精准的花卉定位与识别对自动化农业(如植物表型分析、作物估产和产量监测)至关重要,但现有方法在不同密度场景(稀疏/密集)下的鲁棒性尚不明确,且缺乏适配花卉检测的专用基准数据集。 Method: 在自建FloralSix数据集(2816张高分辨率图像,6类花,涵盖稀疏与密集两种标注模式)上,系统评测YOLOv5s、YOLOv8n/s/m及YOLOv12n等模型;采用Precision、Recall、mAP@0.5和mAP@0.5:0.95指标;对比SGD与其他优化器效果。 Result: YOLOv8m(SGD)在SISBB(稀疏)场景下达到Precision 0.956、Recall 0.951、mAP@0.5 0.978、mAP@0.5:0.95 0.865;YOLOv12n(SGD)在SIMBB(密集)场景下mAP@0.5达0.934、mAP@0.5:0.95达0.752;SGD始终优于其他优化器;模型大小、IoU阈值与标注密度存在显著交互影响。 Conclusion: 花卉检测性能高度依赖于场景密度与模型设计取向:召回率优先模型更适合密集场景,精确率优先模型更适合稀疏场景;SGD是更优优化器选择;该工作为非破坏性作物分析、生长追踪、机器人授粉与胁迫评估提供了可部署的密度敏感检测方案。 Abstract: Precise localization and recognition of flowers are crucial for advancing automated agriculture, particularly in plant phenotyping, crop estimation, and yield monitoring. This paper benchmarks several YOLO architectures such as YOLOv5s, YOLOv8n/s/m, and YOLOv12n for flower object detection under two annotation regimes: single-image single-bounding box (SISBB) and single-image multiple-bounding box (SIMBB). The FloralSix dataset, comprising 2,816 high-resolution photos of six different flower species, is also introduced. It is annotated for both dense (clustered) and sparse (isolated) scenarios. The models were evaluated using Precision, Recall, and Mean Average Precision (mAP) at IoU thresholds of 0.5 (mAP@0.5) and 0.5-0.95 (mAP@0.5:0.95). In SISBB, YOLOv8m (SGD) achieved the best results with Precision 0.956, Recall 0.951, mAP@0.5 0.978, and mAP@0.5:0.95 0.865, illustrating strong accuracy in detecting isolated flowers. With mAP@0.5 0.934 and mAP@0.5:0.95 0.752, YOLOv12n (SGD) outperformed the more complicated SIMBB scenario, proving robustness in dense, multi-object detection. Results show how annotation density, IoU thresholds, and model size interact: recall-optimized models perform better in crowded environments, whereas precision-oriented models perform best in sparse scenarios. In both cases, the Stochastic Gradient Descent (SGD) optimizer consistently performed better than alternatives. These density-sensitive sensors are helpful for non-destructive crop analysis, growth tracking, robotic pollination, and stress evaluation.

[66] Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Massoud Dehghan,Ramona Woitek,Amirreza Mahbod

Main category: cs.CV

TL;DR: 本文系统评估了不同patch size对Vision Transformer(ViT)在12个医学影像数据集(7个2D、5个3D)上分类性能的影响,发现更小的patch size(如1、2、4)显著提升性能,尤其在3D数据上增益达23.78%,并提出简单集成策略进一步提升效果。

Details Motivation: 尽管ViT在计算机视觉中广泛应用,但其关键初始设计参数——patch size——在医学影像(尤其是2D/3D多模态)中的影响尚未被充分研究。 Method: 在12个医学影像数据集(7个2D、5个3D)上,使用单GPU对ViT进行微调,系统比较patch size为1、2、4、7、14、28时的分类性能;并采用融合patch size为1、2、4的模型预测的简单集成策略。 Result: 小patch size(1、2、4)在几乎所有数据集上均取得最优性能:2D数据上patch size 2相比28提升平衡准确率最高达12.78%;3D数据上patch size 1相比14提升最高达23.78%;集成策略进一步提升了多数数据集(尤其2D)的性能。 Conclusion: patch size是ViT在医学影像任务中至关重要的超参数,较小patch size通常带来显著性能增益,应作为医学ViT建模的首选配置,并可通过轻量集成进一步优化。 Abstract: Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT

[67] Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

Aashish Chandra,Aashutosh A,Abhijit Das

Main category: cs.CV

TL;DR: 本文提出了一种从静态图像、语音档案和目标文本生成逼真说话人脸的新方法,通过多纠缠潜在空间融合语音与视频模态特征,并分别解码生成音频和视频。

Details Motivation: 现有方法难以在单张静态图像和语音档案基础上,同步生成自然协调的语音与面部运动,缺乏跨模态的时-空人物特异性建模能力。 Method: 构建多纠缠潜在空间,联合编码目标文本、驱动图像和个体语音档案,生成用于音视频生成的键值对与查询;利用该空间建模语音与视频间的时空人物特异性关联,并分别送入对应模态解码器生成输出。 Result: 实现了从单张图像、语音档案和文本出发,同步生成高保真、时序一致的语音与 talking face 视频。 Conclusion: 多纠缠潜在空间能有效建模跨模态人物特异性时空关系,为语音驱动的 talking face 生成提供了新范式。 Abstract: We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

[68] Deep LoRA-Unfolding Networks for Image Restoration

Xiangming Wang,Haijin Zeng,Benteng Sun,Jiezhang Cao,Kai Zhang,Qiangqiang Shen,Yongyong Chen

Main category: cs.CV

TL;DR: 本文提出LoRun方法,通过在深度展开网络(DUNs)中引入轻量级、阶段特定的LoRA适配器,动态调节各展开阶段的去噪强度,共享一个预训练基础去噪器,显著降低参数量和内存消耗,同时保持或提升图像恢复性能。

Details Motivation: 现有深度展开网络(DUNs)存在两个关键问题:一是各阶段的近端映射模块(PMM)结构与去噪目标固定,无法适应不同阶段变化的噪声水平;二是重复结构导致参数冗余和高内存占用,不利于大规模或资源受限场景部署。 Method: 提出广义低秩自适应(LoRA)深度展开网络LoRun:共享一个预训练基础去噪器作为所有阶段的PMM主干,每个阶段仅插入轻量级LoRA适配器以实现阶段特定的噪声水平自适应;GDM与PMM仍构成基本展开块,但PMM参数大幅压缩。 Result: 在光谱成像重建、压缩感知和超分辨率三大图像恢复任务上验证了LoRun的有效性:相比传统DUNs,实现N倍参数缩减(N为展开阶段数),内存占用显著降低,同时性能持平或更优。 Conclusion: LoRun通过解耦通用恢复能力与阶段特定适应性,为高效、可扩展的深度展开网络提供了新范式,在保持高性能的同时大幅提升模型效率与部署可行性。 Abstract: Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution.It unfolds the iterative optimization steps into a stack of sequentially linked blocks.Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level.However, existing DUNs suffer from two critical limitations: (i) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and (ii) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios.To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN.LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step.This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$-stage DUN with on-par or better performance.Extensive experiments conducted on three IR tasks validate the efficiency of our method.

[69] Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

Houlun Chen,Xin Wang,Guangyao Li,Yuwei Zhou,Yihan Chen,Jia Jia,Wenwu Zhu

Main category: cs.CV

TL;DR: 本文提出Video-TwG框架,通过‘Think-with-Grounding’范式让视频大模型在多模态推理中按需定位关键视频片段,结合两阶段强化课程学习与TwG-GRPO算法,在不依赖复杂模块或大量标注的前提下提升长视频理解性能。

Details Motivation: 现有基于纯文本推理的长视频理解方法受限于固定且有限的视频上下文长度,易忽略关键细节、加剧幻觉;长视频中的时间冗余进一步加剧该问题。 Method: 提出Video-TwG框架,核心为‘Think-with-Grounding’范式;设计两阶段强化课程学习策略(先在小规模带标注短视频数据上学习定位行为,再迁移到多样化长视频QA任务);提出TwG-GRPO强化学习算法,含细粒度定位奖励、自确认伪奖励和准确率门控机制;构建新数据集TwG-51K。 Result: 在Video-MME、LongVideoBench和MLVU三大基准上显著超越主流长视频理解基线;消融实验证明两阶段课程策略与TwG-GRPO对提升定位质量、减少冗余定位且保持问答性能至关重要。 Conclusion: Video-TwG通过动态、按需的视觉定位增强推理过程,有效缓解长视频理解中的上下文限制与幻觉问题,为视频大模型提供了更鲁棒、可扩展的多模态推理新范式。 Abstract: Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.

[70] IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Tingyang Xiao,Liu Liu,Wei Feng,Zhengyu Zou,Xiaolin Zhou,Wei Sui,Hao Li,Dingwen Zhang,Zhizhong Su

Main category: cs.CV

TL;DR: IRIS-SLAM是一种新型RGB语义SLAM系统,通过扩展几何基础模型,联合预测稠密几何与跨视角一致的实例嵌入,实现语义协同的数据关联和实例引导的闭环检测,显著提升建图一致性与宽基线闭环可靠性。

Details Motivation: 现有稠密几何SLAM缺乏深层语义理解与鲁棒闭环能力;当前语义映射方法常因解耦架构和脆弱的数据关联而受限。 Method: 提出IRIS-SLAM,基于实例扩展的几何基础模型,统一生成几何-实例表征,支持语义协同关联与实例引导的闭环检测,并利用视角无关的语义锚点融合几何重建与开放词汇语义映射。 Result: 实验表明IRIS-SLAM在建图一致性与宽基线闭环可靠性上显著优于当前最优方法。 Conclusion: 统一几何-实例表征可有效弥合几何SLAM与开放词汇语义映射之间的鸿沟,提升系统整体鲁棒性与语义感知能力。 Abstract: Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

[71] HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing

Ahmed Akl,Abdelwahed Khamis,Ali Cheraghian,Zhe Wang,Sara Khalifa,Kewen Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的模型编辑方法HIME,通过层敏感性分析(HIS)识别并针对性抑制LVLMs中易产生物体幻觉的解码器层,在不损害原有知识的前提下显著降低幻觉率。

Details Motivation: 大型视觉语言模型(LVLMs)存在物体幻觉问题,影响实际部署可靠性;而现有微调方法计算成本高,亟需高效、无训练的缓解方案。 Method: 提出Hallucination Insensitivity Score(HIS)量化各解码层对幻觉的敏感性,并基于此设计层自适应权重编辑方法HIME,在Qwen、LLaMA、Vicuna等骨干上进行无参数、低开销的特征干预。 Result: HIME在CHAIR、MME和GPT-4V辅助评估等多个开放生成基准上平均降低幻觉61.8%,且不引入额外参数、延迟或计算开销。 Conclusion: 层间幻觉敏感性存在显著差异,HIS可指导精准、轻量的模型编辑,HIME为训练-free幻觉抑制提供了有效且通用的新范式。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while preserving pre-trained knowledge? To address this question, we present a systematic analysis of LVLM decoders built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while preserving pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.

[72] NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Yufan Wang,Sokratis Makrogiannis,Chandra Kambhamettu

Main category: cs.CV

TL;DR: 本文提出NeXt2Former-CD,一种结合ConvNeXt、可变形注意力与Mask2Former的端到端遥感变化检测框架,在多个数据集上超越Mamba等SSM基线,兼顾精度与推理效率。

Details Motivation: 现有状态空间模型(SSMs)虽具良好扩展性,但在处理遥感影像中的配准残差噪声、小目标位移及双时相语义模糊方面存在局限;需探索更鲁棒的卷积与注意力架构替代方案。 Method: 提出NeXt2Former-CD:采用DINOv3预训练的Siamese ConvNeXt编码器提取双时相特征;引入可变形注意力机制进行时序特征融合;结合Mask2Former解码器实现精细化变化分割。 Result: 在LEVIR-CD、WHU-CD和CDD数据集上F1分数与IoU均优于最新Mamba基线;尽管参数量更大,推理延迟与SSM方法相当。 Conclusion: 现代卷积与注意力架构可作为SSMs在遥感变化检测中的有力替代,在精度与实用性间取得更好平衡。 Abstract: State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.

[73] Subtle Motion Blur Detection and Segmentation from Static Image Artworks

Ganesh Samarth,Sibendu Paul,Solale Tabarestani,Caren Chen

Main category: cs.CV

TL;DR: 本文提出SMBlurDetect框架,用于检测静态图像中的细微运动模糊,通过高质量数据集生成与端到端检测器结合,在零样本泛化任务中显著提升精度与分割性能。

Details Motivation: 现有运动模糊检测方法和数据集聚焦于严重模糊,缺乏适用于质量敏感场景(如流媒体缩略图)所需的细粒度像素级标注;主流基准(如GOPRO、NFS)存在合成过强、锐利参考图像残留模糊等问题,导致监督信号模糊。 Method: 构建SMBlurDetect统一框架:1)基于SAM分割区域,结合可控相机/物体运动仿真、alpha感知合成与均衡采样,从超高清美学图像合成具有精确掩码的细微、局部运动模糊图像;2)采用ImageNet预训练编码器的U-Net结构,融合掩码与图像中心训练策略,引入课程学习、难例挖掘、焦点损失、模糊频率通道及分辨率感知增强。 Result: 在GoPro上零样本准确率达89.68%(基线66.50%),CUHK上平均IoU达59.77%(基线9.00%),分割性能提升6.6倍;定性结果表明能精准定位细微模糊区域,支持低质帧自动过滤与智能裁剪ROI提取。 Conclusion: SMBlurDetect解决了静态图像中细微运动模糊检测这一被忽视但关键的质量问题,其高质量合成数据与鲁棒检测器设计为流媒体视觉资产质量保障提供了实用、可扩展的技术路径。 Abstract: Streaming services serve hundreds of millions of viewers worldwide, where visual assets such as thumbnails, box art, and cover images are critical for engagement. Subtle motion blur remains a pervasive quality issue, reducing visual clarity and negatively affecting user trust and click-through rates. However, motion blur detection from static images is underexplored, as existing methods and datasets focus on severe blur and lack fine-grained pixel-level annotations needed for quality-critical applications. Benchmarks such as GOPRO and NFS are dominated by strong synthetic blur and often contain residual blur in their sharp references, leading to ambiguous supervision. We propose SMBlurDetect, a unified framework combining high-quality motion blur specific dataset generation with an end-to-end detector capable of zero-shot detection at multiple granularities. Our pipeline synthesizes realistic motion blur from super high resolution aesthetic images using controllable camera and object motion simulations over SAM segmented regions, enhanced with alpha-aware compositing and balanced sampling to generate subtle, spatially localized blur with precise ground truth masks. We train a U-Net based detector with ImageNet pretrained encoders using a hybrid mask and image centric strategy incorporating curriculum learning, hard negatives, focal loss, blur frequency channels, and resolution aware augmentation.Our method achieves strong zero-shot generalization, reaching 89.68% accuracy on GoPro (vs 66.50% baseline) and 59.77% Mean IoU on CUHK (vs 9.00% baseline), demonstrating 6.6x improvement in segmentation. Qualitative results show accurate localization of subtle blur artifacts, enabling automated filtering of low quality frames and precise region of interest extraction for intelligent cropping.

[74] WiCompass: Oracle-driven Data Scaling for mmWave Human Pose Estimation

Bo Liang,Chen Gong,Haobo Wang,Qirui Liu,Rungui Zhou,Fengzhi Shao,Yubo Wang,Wei Gao,Kaichen Zhou,Guolong Cui,Chenren Xu

Main category: cs.CV

TL;DR: 本文提出WiCompass框架,通过构建通用姿态空间“oracle”来指导毫米波人体姿态估计的数据采集,提升分布外(OOD)鲁棒性,强调数据覆盖质量优于单纯数量扩展。

Details Motivation: 毫米波人体姿态估计(mmWave HPE)虽具隐私优势,但在分布偏移下泛化能力差;单纯扩大数据规模无法有效提升OOD鲁棒性,关键瓶颈在于数据效率与覆盖度。 Method: 提出覆盖感知的数据采集框架WiCompass:利用大规模动捕语料构建通用姿态空间‘oracle’,量化数据冗余并识别未覆盖动作;基于该oracle设计闭环策略,优先采集信息量大的缺失样本。 Result: 实验表明,WiCompass在相同数据预算下持续提升OOD准确率,且扩展性能显著优于传统采集策略。 Conclusion: 从盲目扩数转向覆盖驱动的数据获取,为构建鲁棒毫米波感知系统提供了切实可行的路径。 Abstract: Millimeter-wave Human Pose Estimation (mmWave HPE) promises privacy but suffers from poor generalization under distribution shifts. We demonstrate that brute-force data scaling is ineffective for out-of-distribution (OOD) robustness; efficiency and coverage are the true bottlenecks. To address this, we introduce WiCompass, a coverage-aware data-collection framework. WiCompass leverages large-scale motion-capture corpora to build a universal pose space ``oracle'' that quantifies dataset redundancy and identifies underrepresented motions. Guided by this oracle, WiCompass employs a closed-loop policy to prioritize collecting informative missing samples. Experiments show that WiCompass consistently improves OOD accuracy at matched budgets and exhibits superior scaling behavior compared to conventional collection strategies. By shifting focus from brute-force scaling to coverage-aware data acquisition, this work offers a practical path toward robust mmWave sensing.

[75] MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee,Tangatar Madi,Advait Swaminathan,Nguyen Dao Minh Anh,Shivank Garg,Kevin Zhu,Vasu Sharma

Main category: cs.CV

TL;DR: 本文提出了MiSCHiEF基准,包含安全(MiS)和文化(MiC)两个细粒度图像-文本对比数据集,用于评估视觉语言模型在细微差异场景下的对齐能力;实验发现当前模型在拒绝错误配对和跨模态精确接地方面仍存在显著缺陷。

Details Motivation: 细粒度图像-文本对齐对视觉语言模型至关重要,尤其在安全风险识别和文化语境区分等社会关键场景中,微小误判可能导致严重现实后果。 Method: 构建了基于对比样本设计的MiS(安全)和MiC(文化)两个新基准数据集,每组样本含两幅极相似图像与两段极相似文本;在四个主流VLM上评测其在图像-文本匹配、caption选择、image选择等细粒度判别任务上的表现。 Result: 模型普遍更擅长确认正确配对而非拒绝错误配对;在给定图像选择正确文本的任务上准确率高于反向任务;所有模型在两类数据集上均表现出明显模态错位,细粒度跨模态接地能力不足。 Conclusion: 当前视觉语言模型在需精细语义与视觉区分的应用中仍面临严峻的跨模态对齐挑战,亟需提升细粒度图像-文本 grounding 能力。 Abstract: Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.

[76] LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency

Weilong Yan,Haipeng Li,Hao Xu,Nianjin Ye,Yihao Ai,Shuaicheng Liu,Jingyu Hu

Main category: cs.CV

TL;DR: 本文提出LaS-Comp,一种零样本、类别无关的3D形状补全方法,利用3D基础模型的几何先验,通过两阶段设计(显式替换+隐式优化)实现高质量补全,无需训练且兼容多种3D基础模型,并构建了综合性基准Omni-Comp进行评估。

Details Motivation: 现有3D形状补全方法通常依赖于特定类别或大量标注数据,难以泛化到多样化的部分观测场景;亟需一种零样本、类别无关、能充分利用3D基础模型几何先验的通用补全方法。 Method: 提出LaS-Comp框架:第一阶段为显式替换,保留原始部分观测几何结构;第二阶段为隐式细化,优化观测与合成区域间的边界一致性;整个流程无需微调或训练,可即插即用地适配不同3D基础模型。同时构建Omni-Comp基准,融合真实与合成数据、覆盖多种复杂遮挡模式。 Result: 在定量与定性实验中,LaS-Comp显著超越现有最先进方法,在多种部分观测设置下展现出更强的鲁棒性与泛化能力。 Conclusion: LaS-Comp验证了直接利用3D基础模型几何先验进行零样本、类别无关补全的可行性与有效性,为无监督/少样本3D理解提供了新范式。 Abstract: This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.

[77] Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

Haobo Lin,Tianyi Bai,Chen Chen,Jiajun Zhang,Bohan Zeng,Wentao Zhang,Binhang Yuan

Main category: cs.CV

TL;DR: 本文提出了一种合成复杂多模态几何问题的流水线,构建了名为GeoCode的数据集,并引入基于代码预测的视觉-符号对齐策略,显著提升了模型在几何推理任务上的性能。

Details Motivation: 当前视觉-语言模型在复杂几何构造上表现不佳,主要受限于训练数据不足和视觉-符号对齐能力弱。 Method: 提出一种从零合成多模态几何问题的流水线,构建GeoCode数据集,将问题生成解耦为符号种子构建、带验证的具身实例化和基于代码的图表渲染;并利用绘图代码,将代码预测作为显式的对齐目标。 Result: GeoCode数据集具有更高的结构复杂性和推理难度,且通过多阶段验证保证数学正确性;在多个几何基准测试中,基于GeoCode训练的模型性能持续提升。 Conclusion: GeoCode数据集及其代码驱动的对齐策略有效提升了多模态几何推理能力,为该领域提供了高质量基准和新训练范式。 Abstract: Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.

[78] MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

Haoyu Zhang,Yuwei Wu,Pengxiang Li,Xintong Zhang,Zhi Gao,Rui Gao,Mingyang Gao,Che Sun,Yunde Jia

Main category: cs.CV

TL;DR: 本文提出MIRROR框架,通过基于视觉区域的迭代反思(草案-批判-区域验证-修订)来增强多模态推理能力,减少视觉幻觉,并构建ReflectV数据集支持训练。

Details Motivation: 现有视觉语言模型在处理模糊或复杂视觉输入时易产生幻觉或逻辑错误,即使进行文本层面的‘反思’,修正结果仍缺乏图像证据支撑。 Method: 提出MIRROR框架:一个包含草案、批判、基于区域的验证和修订的闭环迭代过程;构建ReflectV数据集,包含反思触发、区域验证动作及基于视觉证据的答案修订。 Result: 在通用视觉语言基准和视觉语言推理基准上,MIRROR显著提升答案正确率并降低视觉幻觉。 Conclusion: 将反思建模为一种证据驱动、区域感知的视觉验证过程,比纯文本修订更有效,可显著增强VLM的视觉接地能力。 Abstract: In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

[79] Benchmarking Computational Pathology Foundation Models For Semantic Segmentation

Lavish Ramchandani,Aashay Tinaikar,Dev Kumar Das,Rohit Garg,Tijo Thomas

Main category: cs.CV

TL;DR: 本文提出了一种无需微调的、基于注意力图和XGBoost的快速可解释基准评估方法,系统比较了10个基础模型在4个组织病理学数据集上的像素级语义分割性能;结果表明多模态模型CONCH表现最佳,且多模型特征融合(CONCH+PathDino+CellViT)显著提升性能(平均+7.95%)。

Details Motivation: 现有基础模型(如CLIP、DINO、CONCH)在组织病理学像素级分割任务上缺乏系统、独立的评估。 Method: 利用基础模型的注意力图作为像素级特征,输入XGBoost分类器进行分割预测,无需微调,实现快速、可解释、模型无关的评估。 Result: CONCH整体最优,PathDino次之;CONCH、PathDino与CellViT特征拼接在所有数据集上平均提升分割性能7.95%。 Conclusion: 多模态基础模型(如CONCH)更适用于组织病理学分割;不同模型学习到互补表征,其特征融合能显著增强泛化能力与性能。 Abstract: In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.

[80] Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement

Yuran Dong,Hang Dai,Mang Ye

Main category: cs.CV

TL;DR: 本文提出EditedID框架,通过对其齐、解耦与纠缠机制,解决多模态编辑大模型在真实人像编辑中面部身份一致性下降的问题,实现无需训练、即插即用的高保真人脸身份与编辑元素IP一致性保持。

Details Motivation: 现有方法难以同时保持人脸身份(ID)和编辑元素IP的一致性,主要受限于跨源分布偏置和跨源特征污染问题。 Method: 提出Alignment-Disentanglement-Entanglement框架,包含:1)自适应混合策略对齐跨源潜在表示;2)混合求解器解耦源特定的身份属性与细节;3)注意力门控机制选择性纠缠视觉元素。 Result: 在多项实验中达到SOTA性能,显著提升原始人脸ID与编辑元素IP的一致性;支持开域单/多人脸编辑,且为训练无关、即插即用方案。 Conclusion: EditedID为多模态编辑大模型在真实人物编辑场景中的实用化部署提供了新基准和可行路径。 Abstract: Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eye's high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models. Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination. To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency. As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios. The code is available at https://github.com/NDYBSNDY/EditedID.

[81] Driving with A Thousand Faces: A Benchmark for Closed-Loop Personalized End-to-End Autonomous Driving

Xiaoru Dong,Ruiqin Li,Xiao Han,Zhenxuan Wu,Jiamin Wang,Jian Chen,Qi Jiang,SM Yiu,Xinge Zhu,Yuexin Ma

Main category: cs.CV

TL;DR: 本文提出Person2Drive,一个面向个性化端到端自动驾驶(E2E-AD)的综合平台与基准,涵盖个性化数据采集系统、基于风格向量的定量评估指标(MMD与KL散度),以及带风格奖励模型的个性化E2E框架,解决了个体驾驶风格建模的数据、评估与算法三方面缺失问题。

Details Motivation: 人类驾驶行为具有显著个体差异,但现有端到端自动驾驶系统多学习单一平均风格,缺乏对个性化建模的支持;同时面临个体标注数据稀缺、无量化风格评估指标、无有效风格表征学习算法三大挑战。 Method: 提出Person2Drive平台,包含:1)开源可扩展的仿真数据采集系统,生成多样化个性化驾驶轨迹数据;2)基于风格向量的评估体系,采用最大均值差异(MMD)和KL散度量化个体驾驶风格差异;3)集成风格奖励模型的个性化E2E-AD框架,支持模型对用户风格的高效适配。 Result: 实验表明Person2Drive支持细粒度风格分析、可复现的个性化评估,并显著提升E2E模型在安全性和风格一致性上的表现;配套数据集与代码将在论文录用后开源。 Conclusion: Person2Drive首次系统性地构建了个性化端到端自动驾驶的完整技术闭环,为驾驶风格建模、评估与自适应控制提供了统一基准与实用工具,推动E2E-AD向以人为本的方向发展。 Abstract: Human driving behavior is inherently diverse, yet most end-to-end autonomous driving (E2E-AD) systems learn a single average driving style, neglecting individual differences. Achieving personalized E2E-AD faces challenges across three levels: limited real-world datasets with individual-level annotations, a lack of quantitative metrics for evaluating personal driving styles, and the absence of algorithms that can learn stylized representations from users' trajectories. To address these gaps, we propose Person2Drive, a comprehensive personalized E2E-AD platform and benchmark. It includes an open-source, flexible data collection system that simulates realistic scenarios to generate scalable and diverse personalized driving datasets; style vector-based evaluation metrics with Maximum Mean Discrepancy and KL divergence to comprehensively quantify individual driving behaviors; and a personalized E2E-AD framework with a style reward model that efficiently adapts E2E models for safe and individualized driving. Extensive experiments demonstrate that Person2Drive enables fine-grained analysis, reproducible evaluation, and effective personalization in end-to-end autonomous driving. Our dataset and code will be released after acceptance.

[82] TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

Haobo Lin,Tianyi Bai,Jiajun Zhang,Xuanhao Chang,Sheng Lu,Fangming Gu,Zengjie Hu,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出TAG框架,通过将视觉-语言模型的推理过程显式约束在面部动作单元(AUs)上,提升面部表情识别(FER)的可解释性、鲁棒性与视觉忠实性。

Details Motivation: 现有视觉-语言模型在FER中生成的自然语言解释常脱离视觉证据、易幻觉、泛化差,缺乏可验证性。 Method: 提出TAG框架:1)以AU标注的推理轨迹进行监督微调;2)引入AU感知奖励函数,结合外部AU检测器进行强化学习,确保中间推理步骤在AU相关面部区域上接地。 Result: 在RAF-DB、FERPlus和AffectNet上均超越主流开源与闭源VLM基线,同时显著提升视觉忠实性;消融与偏好研究表明AU接地奖励能稳定推理、抑制幻觉。 Conclusion: 结构化的AU接地中间表征对构建可信的多模态FER推理至关重要,为细粒度视觉理解提供了可验证、鲁棒的新范式。 Abstract: Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .

[83] A high-resolution nationwide urban village mapping product for 342 Chinese cities based on foundation models

Lubin Bai,Sheng Xiao,Ziyu Yin,Haoyu Wang,Siyang Wu,Xiuyuan Zhang,Shihong Du

Main category: cs.CV

TL;DR: 本文提出了GeoLink-UV,一个基于多源地理空间数据和基础模型驱动框架生成的中国342个城市的高分辨率城中村(UVs)全国地图数据集,具备地理分层验证的高可靠性,并揭示了其区域差异性与形态特征,服务于城市治理、更新与可持续发展目标。

Details Motivation: 中国城中村(UVs)异质性强、分布广,缺乏一致可靠的全国性识别数据集,制约城市治理与可持续发展研究。 Method: 融合光学遥感影像与地理矢量数据,构建基础模型驱动的地图生成框架,实现全国尺度UVs边界精准提取,并采用28个城市的独立样本进行地理分层精度评估。 Result: 生成覆盖342个城市的GeoLink-UV高分辨率数据集;验证显示其在全国不同城市背景下具有高可靠性;发现UVs占建成区平均8%,集中于中南部,呈现普遍低层高密度但区域形态各异的特征。 Conclusion: GeoLink-UV为城市研究、非正规住区监测与基于证据的城市更新规划提供了开放、系统验证的地理空间基础,直接支撑联合国可持续发展目标11的大尺度评估。 Abstract: Urban Villages (UVs) represent a distinctive form of high-density informal settlement embedded within China's rapidly urbanizing cities. Accurate identification of UVs is critical for urban governance, renewal, and sustainable development. But due to the pronounced heterogeneity and diversity of UVs across China's vast territory, a consistent and reliable nationwide dataset has been lacking. In this work, we present GeoLink-UV, a high-resolution nationwide UV mapping product that clearly delineates the locations and boundaries of UVs in 342 Chinese cities. The dataset is derived from multisource geospatial data, including optical remote sensing images and geo-vector data, and is generated through a foundation model-driven mapping framework designed to address the generalization issues and improve the product quality. A geographically stratified accuracy assessment based on independent samples from 28 cities confirms the reliability and scientific credibility of the nationwide dataset across heterogeneous urban contexts. Based on this nationwide product, we reveal substantial interregional disparities in UV prevalence and spatial configuration. On average, UV areas account for 8 % of built-up land, with marked clustering in central and south China. Building-level analysis further confirms a consistent low-rise, high-density development pattern of UVs nationwide, while highlighting regionally differentiated morphological characteristics. The GeoLink-UV dataset provides an open and systematically validated geospatial foundation for urban studies, informal settlement monitoring, and evidence-based urban renewal planning, and contributes directly to large-scale assessments aligned with Sustainable Development Goal 11. The GeoLink-UV dataset introduced in this article is freely available at https://doi.org/10.5281/zenodo.18688062.

[84] Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

Pablo Meseguer,Rocío del Amor,Valery Naranjo

Main category: cs.CV

TL;DR: 本文提出了一种名为Zero-Shot Multiple-Instance Learning (ZS-MIL)的新方法,用于解决组织病理学全切片图像(WSI)在少样本场景下的分类问题,通过利用视觉语言模型(VLM)的文本编码器类级别嵌入来初始化分类器权重,从而提升多实例学习(MIL)框架的性能与鲁棒性。

Details Motivation: 现有基于多实例学习(MIL)的WSI分类方法在少样本微调中依赖随机初始化线性分类器权重,导致性能不稳定且常低于零样本预测效果;需更优的初始化策略以提升高效迁移学习(ETL)性能。 Method: 提出ZS-MIL方法:利用预训练VLM文本编码器生成的类级别文本嵌入,作为MIL框架中分类层的初始权重,直接用于计算包级(slide-level)概率;无需额外训练分类器权重,适配于少样本设置。 Result: 在多个子类型预测实验中,ZS-MIL相比经典权重初始化方法(如随机、Kaiming初始化等)展现出更高且更稳定的分类性能,尤其在少样本高效迁移学习场景下显著优于基线。 Conclusion: ZS-MIL是一种简单而有效的方法,将零样本语义先验引入MIL分类层初始化,缓解了少样本WSI分类中因权重初始化不当导致的性能下降问题,为VLM在病理图像分析中的迁移应用提供了新范式。 Abstract: Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

[85] MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

Changlu Guo,Anders Nymark Christensen,Anders Bjorholm Dahl,Morten Rieger Hannemose

Main category: cs.CV

TL;DR: 本文提出MaskDiME,一种无需训练、基于扩散模型的视觉反事实解释方法,通过局部采样实现语义一致与空间精准的反事实图像生成,在保持高图像保真度的同时,推理速度提升30倍以上,性能达SOTA或相当水平。

Details Motivation: 现有基于扩散模型的反事实生成方法存在计算开销大、采样慢、修改区域定位不精确等问题,亟需更高效、精准且实用的解决方案。 Method: 提出MaskDiME框架,采用训练免费的、自适应聚焦决策相关区域的局部采样策略,统一实现语义一致性与空间定位精度。 Result: 在五个涵盖不同视觉领域的基准数据集上,MaskDiME推理速度比基线方法快30倍以上,同时达到可比或最先进的性能。 Conclusion: MaskDiME是一种简单、快速、有效且通用的反事实解释框架,为高效、可解释的深度模型分析提供了实用新路径。 Abstract: Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

[86] Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

Zhou Jiang,Yandong Wen,Zhen Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需重训练基础模型的对比式引导(contrastive guidance)方法,通过解耦正负样本偏好学习、构造差分引导向量,提升文本到图像扩散模型对人类偏好的对齐效果与泛化能力。

Details Motivation: 现有直接偏好优化(DPO)方法在大规模微调中存在泛化差距;需更鲁棒、无需重训练的偏好对齐机制。 Method: 将偏好对齐建模为无分类器引导(CFG)问题,设计双模块结构分别学习正/负样本偏好,推理时用正预测减负预测构建对比引导向量,并按用户设定强度加到基础模型预测上。 Result: 在Stable Diffusion 1.5/XL上结合Pick-a-Pic v2和HPDv3评测,取得一致的定量与定性提升。 Conclusion: 对比式引导是一种简单有效、无需微调基础模型的偏好对齐新范式,提升了对齐精度、可控性与泛化性。 Abstract: Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.

[87] Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

Wanqi Wang,Jingcai Guo,Yuxiang Cai,Zhi Chen

Main category: cs.CV

TL;DR: 本文提出了一种双分支检测器LMP,通过结合文本提示与目标域视觉示例学习多模态原型,以提升跨域少样本目标检测性能。

Details Motivation: 现有基于视觉语言模型的开放词汇检测器严重依赖文本提示,缺乏对目标域特有视觉信息的建模,难以在少样本条件下精确定位。 Method: 提出双分支检测器LMP:视觉引导分支利用支持图像RoI构建类级原型,并在查询图像中动态生成抖动框作为难负样本原型;文本引导分支保留开放词汇语义;两分支联合训练、推理时融合。 Result: 在六个跨域基准数据集及1/5/10-shot设置下,mAP达到SOTA或极具竞争力水平。 Conclusion: 融合文本语义与目标域视觉原型的双分支架构,能有效缓解跨域少样本检测中的域偏移与定位不准问题。 Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.

[88] HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Chongyang Xu,Shen Cheng,Haipeng Li,Haoqiang Fan,Ziliang Feng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出HeRO,一种基于扩散模型的分层语义场策略,通过融合DINOv2与Stable Diffusion特征实现几何与语义耦合,显著提升机器人姿态感知操作性能。

Details Motivation: 纯几何策略缺乏部件级语义信息,难以支持姿态感知操作(如区分鞋的鞋头与鞋跟) Method: HeRO采用稠密语义提升技术融合DINOv2的判别性几何敏感特征与Stable Diffusion的平滑全局一致对应关系;构建全局场与多个局部场;通过基于排列不变网络架构的分层条件模块对生成式去噪器进行条件控制 Result: 在Place Dual Shoes任务中成功率提升12.3%,在六个挑战性姿态感知任务上平均提升6.5% Conclusion: HeRO实现了几何与语义的有效耦合,为姿态感知机器人操作提供了新范式,达到当前最优性能 Abstract: Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

[89] Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

Xiaoyu Dong,Jiahuan Li,Ziteng Cui,Naoto Yokoya

Main category: cs.CV

TL;DR: 本文提出RobSelf,一种完全自监督的跨模态超分辨率模型,无需训练数据、真实标签或预对齐,适用于现实世界中错位的真实数据。

Details Motivation: 解决现实世界中跨模态超分辨率任务中低分辨率源图像与高分辨率引导图像存在复杂空间错位且配对样本有限的问题。 Method: 提出RobSelf模型,包含错位感知特征翻译器和内容感知参考滤波器:前者将无监督跨模态/跨分辨率对齐建模为弱监督错位感知翻译子任务;后者基于翻译后的引导特征对源图像进行参考式判别性自增强。 Result: 在多种任务上达到SOTA性能和更高效率,并发布真实世界数据集RealMisSR。 Conclusion: RobSelf实现了无需任何监督或预处理的高效、鲁棒跨模态超分辨率,推动了错位数据下自监督SR的发展。 Abstract: We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: https://github.com/palmdong/RobSelf.

[90] Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

Liying Yang,Jialun Liu,Jiakui Hu,Chenhao Guan,Haibin Huang,Fangqiu Yi,Chi Zhang,Yanyan Liang

Main category: cs.CV

TL;DR: 本文提出4DSTAR模型,通过动态时空状态传播自回归模型(STAR)和4D VQ-VAE,实现高质量、时空一致的4D对象生成。

Details Motivation: 现有基于扩散的方法难以保证时空一致性,因未能充分利用所有先前时间步的输出来指导当前步生成。 Method: 提出STAR模型,将预测token按时间步分组,通过时空容器动态更新历史组的状态特征并作为条件引导下一组token预测;同时设计4D VQ-VAE将4D结构隐式编码为离散token,并解码为时序一致的动态3D高斯表示。 Result: 实验表明4DSTAR能生成时空一致的4D对象,性能与扩散模型相当。 Conclusion: 4DSTAR通过引入时空状态传播机制和专用4D编码器-解码器架构,有效解决了4D生成中的时空一致性难题。 Abstract: Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.

[91] IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbation

Fadi Boutros,Eduarda Caldeira,Tahar Chettaoui,Naser Damer

Main category: cs.CV

TL;DR: 本文提出IDPERTURB方法,通过在单位超球面的约束角度区域内扰动身份嵌入,增强合成面部图像的类内多样性,从而提升人脸识别模型的泛化能力。

Details Motivation: 隐私与法律限制使真实人脸数据使用受限,而现有身份条件扩散模型生成的合成人脸缺乏足够的类内变化,影响人脸识别系统鲁棒性训练。 Method: 提出IDPERTURB几何驱动采样策略,在单位超球面上对身份嵌入进行角度约束扰动,将扰动后的嵌入作为预训练扩散模型的条件输入,生成身份一致但外观多样的人脸图像。 Result: 基于IDPERTURB生成的数据训练的人脸识别模型在多个基准测试中性能优于现有合成数据方法。 Conclusion: IDPERTURB是一种无需修改生成模型、简单有效的提升合成人脸类内多样性的策略,有助于训练更泛化的人脸识别系统。 Abstract: Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDPERTURB, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDPERTURB perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results demonstrate that training FR on datasets generated using IDPERTURB yields improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches.

[92] CLAP Convolutional Lightweight Autoencoder for Plant Disease Classification

Asish Bera,Subhajit Roy,Sudiptendu Banerjee

Main category: cs.CV

TL;DR: 本文提出了一种轻量级卷积自编码器CLAP,用于植物病害分类,在保持高准确率的同时显著降低计算成本。

Details Motivation: 传统机器学习和现有深度学习方法在真实田间条件下难以有效捕捉植物健康状态的细微差异,且许多深度模型计算开销大。 Method: 提出基于可分离卷积和Sigmoid门控机制的轻量级卷积自编码器(CLAP),结合编码器与解码器特征图以增强表征能力。 Result: 在三个公开植物数据集(Integrated Plant Disease、Groundnut、CCMT)上取得优异或具竞争力的分类精度,仅需500万参数,训练时间20ms/图像,推理时间1ms/图像。 Conclusion: CLAP在性能与效率之间实现了良好平衡,适用于资源受限的实地植物病害诊断场景。 Abstract: Convolutional neural networks have remarkably progressed the performance of distinguishing plant diseases, severity grading, and nutrition deficiency prediction using leaf images. However, these tasks become more challenging in a realistic in-situ field condition. Often, a traditional machine learning model may fail to capture and interpret discriminative characteristics of plant health, growth and diseases due to subtle variations within leaf subcategories. A few deep learning methods have used additional preprocessing stages or network modules to address the problem, whereas several other methods have utilized pre-trained backbone CNNs, most of which are computationally intensive. Therefore, to address the challenge, we propose a lightweight autoencoder using separable convolutional layers in its encoder decoder blocks. A sigmoid gating is applied for refining the prowess of the encoders feature discriminability, which is improved further by the decoder. Finally, the feature maps of the encoder decoder are combined for rich feature representation before classification. The proposed Convolutional Lightweight Autoencoder for Plant disease classification, called CLAP, has been experimented on three public plant datasets consisting of cassava, tomato, maize, groundnut, grapes, etc. for determining plant health conditions. The CLAP has attained improved or competitive accuracies on the Integrated Plant Disease, Groundnut, and CCMT datasets balancing a tradeoff between the performance, and little computational cost requiring 5 million parameters. The training time is 20 milliseconds and inference time is 1 ms per image.

[93] Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

Jiangling Zhang,Shuxuan Gao,Bofan Liu,Siqiang Feng,Jirui Huang,Yaxiong Chen,Ziyu Chen

Main category: cs.CV

TL;DR: 本文提出IFA-Net,通过冻结预训练的Masked Autoencoder(MAE)作为真实图像先验,采用两阶段闭环机制(粗定位+基于任务自适应先验注入的精修)实现对AI生成图像篡改区域的像素级精准定位,在扩散模型篡改检测上显著提升IoU和F1指标,并具备良好泛化性。

Details Motivation: 现有方法多针对特定伪造类型学习判别模式,难以应对不断演进的新编辑技术,亟需一种能泛化到未知篡改类型、基于‘什么是真实’建模的通用检测范式。 Method: 提出迭代伪造放大网络(IFA-Net):第一阶段用双流分割网络(DSSN)融合原图与冻结MAE的重建残差进行粗定位;第二阶段用任务自适应先验注入(TAPI)模块将粗预测转化为提示,引导MAE解码器在可疑区域放大重建失败,实现精确定位。 Result: 在四个基于扩散模型的修复基准上,IFA-Net平均IoU提升6.5%,F1-score提升8.1%(相较次优方法),且对传统篡改类型也展现出强泛化能力。 Conclusion: 基于自然图像流形偏差建模‘真实性先验’并迭代放大伪造痕迹,是一种更鲁棒、更泛化的AI图像篡改检测新范式。 Abstract: The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries and often struggle with novel manipulations as editing techniques continue to evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, followed by a Task-Adaptive Prior Injection (TAPI) module that converts this coarse prediction into guiding prompts to steer the MAE decoder and amplify reconstruction failures in suspicious regions for precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.

Chengwei Xia,Fan Ma,Ruijie Quan,Yunqiu Xu,Kun Zhan,Yi Yang

Main category: cs.CV

TL;DR: 本文提出了一种为多模态大语言模型(MLLMs)生成版权触发器的框架,通过构造特定图像触发器,使衍生模型输出所有权相关文本,而其他模型保持沉默,从而实现模型版本溯源与知识产权保护。

Details Motivation: 随着多模态大语言模型(MLLMs)的快速部署和广泛应用,模型版本归属与所有权争议频发,亟需有效的知识产权保护机制。 Method: 将触发图像建模为可学习张量,采用双注入对抗优化:一是通过辅助MLLM保证输出文本与预设所有权文本的一致性,并反向传播一致性损失;二是最小化图像与目标文本在CLIP特征空间的距离;此外引入基于原始模型构建的抗干扰辅助模型进行额外对抗训练,提升对强微调模型的鲁棒性。 Result: 大量实验表明,该双注入方法在多种微调策略和领域偏移场景下均能有效追踪模型谱系,具备良好鲁棒性与实用性。 Conclusion: 所提版权触发器框架为MLLMs提供了可验证、抗干扰、可扩展的所有权嵌入与溯源手段,有助于解决模型知识产权保护难题。 Abstract: With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This auxiliary model is specifically trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.

[95] DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Aditya Kumar Singh,Hitesh Kandala,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: 本文提出DUET-VLM,一种双阶段视觉令牌压缩框架,在保持高准确率的同时显著减少视觉令牌数量,适用于图像和视频多模态模型。

Details Motivation: 现有视觉语言模型因密集视觉令牌化而计算开销大,已有高效方法常以精度换速度。 Method: DUET-VLM包含两个阶段:(a) 视觉编码器输出的冗余感知压缩,生成信息保留的视觉令牌;(b) 语言主干中逐层、文本引导的显著性视觉令牌裁剪。 Result: 在LLaVA-1.5-7B上,67%令牌减少时保持>99%准确率,89%减少时仍>97%;训练中集成时更优(99.7%和97.6%);在Video-LLaVA-7B中甚至超越基线(>100%准确率,53.1%减少;93.4%减少时仍达97.6%)。 Conclusion: DUET-VLM通过端到端训练实现鲁棒压缩,在相同计算预算下生成紧凑且语义丰富的表征,显著优于先前SOTA方法。 Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

[96] Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Dong Zhao,Qi Zang,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong

Main category: cs.CV

TL;DR: 本文提出了开放词汇域泛化语义分割(OVDG-SS)新任务,旨在同时应对未见域和未见类别的挑战,并构建首个面向自动驾驶的OVDG-SS基准;针对域偏移导致文本-图像相关性失真的问题,提出基于状态空间的S2-Corr机制以提升跨域鲁棒性与效率。

Details Motivation: 现有域泛化语义分割方法局限于固定已知类别,而开放词汇语义分割模型又对域偏移敏感,尤其在城市驾驶等真实场景中鲁棒性不足,亟需联合解决未见域与未见类别的泛化问题。 Method: 提出OVDG-SS新任务及首个自动驾驶基准;设计S2-Corr机制,利用状态空间建模动态校准预训练视觉语言模型中的文本-图像相关性,缓解域偏移带来的相关性扭曲。 Result: 在自建OVDG-SS基准(涵盖synthetic-to-real与real-to-real跨域设置)上,所提方法显著优于现有开放词汇语义分割方法,兼具更高跨域性能与计算效率。 Conclusion: OVDG-SS是语义分割向开放世界部署迈出的关键一步;S2-Corr验证了显式建模和校准文本-图像相关性对提升域泛化能力的有效性,为VLM驱动的视觉任务提供了新思路。 Abstract: Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

[97] Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

Shile Li,Markus Karmann,Onay Urfalioglu

Main category: cs.CV

TL;DR: 本文提出了一种端到端联合量化Vision Transformer的框架,支持无数据校准(利用Stable Diffusion Turbo生成多样化图像),在W4A4、W3A3及极低比特W1.58A8下实现SOTA精度,适用于ViT/DeiT/Swin-T等多种模型。

Details Motivation: 现有方法多为后训练量化或分块重建,难以建模层间依赖;且低比特下ViT类模型精度下降严重,缺乏高效、通用、无标签数据的联合量化方案。 Method: 提出端到端联合量化框架,全网络统一优化;设计基于Stable Diffusion Turbo的数据无校准策略,通过学习多模态提示(multi-mode prompts)合成多样性图像特征。 Result: 在ImageNet上达到W4A4和W3A3 SOTA精度;首次在W1.58A8极低比特下保持ViT、DeiT、Swin-T强精度;数据无校准性能媲美真实ImageNet数据校准。 Conclusion: 该方法验证了端到端联合量化与生成式数据合成在校准中的有效性,为ViT类模型在边缘设备上的高效部署提供了新路径。 Abstract: We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a photo of ".