Table of Contents
cs.CL [Back]
[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration
Longxuan Wei,Yubo Zhang,Zijiao Zhang,Zhihu Wang,Shiwan Zhao,Tianyu Huang,Huiting Zhao,Chenfei Liu,Shenao Zhang,Junchi Yan
Main category: cs.CL
TL;DR: 本文提出Entropy-Tree,一种基于熵的树状解码方法,用于提升大语言模型在推理任务中的准确性和不确定性校准能力。
Details
Motivation: 现有解码策略(如随机采样或多路径独立采样)存在盲目性或冗余性,无法有效利用模型不确定性信息。 Method: Entropy-Tree以模型输出熵为信号进行分支决策,仅在模型真正不确定的位置扩展搜索树,实现结构化、高效探索。 Result: 在多个模型和数据集上,Entropy-Tree的pass@k优于Multi-chain;其预测熵在AUROC指标上优于多种传统不确定性度量。 Conclusion: Entropy-Tree将高效结构化搜索与可靠不确定性估计统一于单一解码过程,提升了推理性能与可信度。 Abstract: Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions--expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports
Edward Ajayi
Main category: cs.CL
TL;DR: AfriEconQA is a new benchmark dataset for African economic question answering, built from 236 World Bank reports, requiring numerical reasoning and temporal disambiguation; it exposes severe knowledge gaps in current LLMs and RAG systems.
Details
Motivation: To address the lack of domain-specific, high-quality benchmarks for African economic analysis—especially given the scarcity of such data in LLM pretraining corpora—and to rigorously evaluate IR and RAG systems on precise, temporally grounded economic reasoning. Method: Constructed AfriEconQA: 8,937 QA instances from 236 World Bank reports, with strict evidence-answer alignment; evaluated via an 11-experiment matrix comparing zero-shot (GPT-5 Mini) and RAG (GPT-4o, Qwen 32B) setups across five embedding/ranking strategies. Result: Zero-shot models failed on >90% of queries; even advanced RAG pipelines achieved only limited precision, confirming AfriEconQA’s difficulty and utility as a stress test for domain-specific IR/RAG. Conclusion: AfriEconQA establishes a novel, challenging, and necessary benchmark for advancing domain-specific information retrieval and retrieval-augmented generation in underrepresented economic contexts, particularly Africa. Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.[3] Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma
Main category: cs.CL
TL;DR: 本文提出了一种数据工程框架,解决真实语料中因标注伪影(如话题标签)导致的知识图谱质量下降问题,发现预处理对词向量回溯调整(retrofitting)效果的影响远超算法选择本身。
Details
Motivation: 现有嵌入回溯调整方法的效果严重依赖知识图谱质量,而知识图谱质量又易受文本预处理中人为标注(如hashtag)带来的噪声影响,亟需系统性解决数据质量问题。 Method: 提出一种数据工程框架,重点识别并缓解标注伪影(特别是hashtag)对知识图谱密度和边质量的干扰;在清洗前后对比多种retrofitting方法(包括EWMA)的检索性能变化。 Result: 在含噪声图谱上所有retrofitting方法均显著退化(-3.5%至-5.2%,p<0.05);经预处理后,EWMA方法提升+6.2%(p=0.0348),定量合成类问题提升达+33.8%;预处理质量带来的性能波动(>10%)远超不同算法间差异(约3%)。 Conclusion: 预处理质量是决定retrofitting成败的首要因素,远比算法选择重要;应将数据工程置于嵌入增强流程的核心位置。 Abstract: Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success.[4] MALTopic: Multi-Agent LLM Topic Modeling Framework
Yash Sharma
Main category: cs.CL
TL;DR: 本文提出了一种多智能体大语言模型主题建模框架(MALTopic),通过将主题建模分解为多个由LLM代理执行的专门任务(增强、建模、去重),融合结构化与非结构化调查数据,显著提升了主题连贯性、多样性与可解释性。
Details
Motivation: 传统主题建模方法仅处理自由文本,忽略结构化/分类调查数据,且生成的主题抽象、需大量人工解读。 Method: 提出MALTopic框架,包含三个LLM代理:增强代理(利用结构化数据丰富文本)、主题建模代理(提取潜在主题)、去重代理(优化结果)。 Result: 在调查数据集上的对比实验表明,MALTopic在主题连贯性、多样性与可解释性上显著优于LDA和BERTopic。 Conclusion: MALTopic通过融合结构化数据与多智能体架构,生成更可读、上下文相关性更强的主题,为复杂调查数据分析提供了更有效的解决方案。 Abstract: Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free-text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi-Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi-agent approach, MALTopic generates human-readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.[5] Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
Weiwei Wang,Jiyong Min,Weijie Zou
Main category: cs.CL
TL;DR: 本文系统研究了大语言模型(LLMs)在长上下文场景中出现的‘智能退化’现象——即当上下文接近某一临界长度时性能骤降超30%,并首次在开源Qwen2.5-7B模型上实证识别出该临界阈值为最大上下文长度的40–50%,提出‘浅层长上下文适应’概念及统一分析框架。
Details
Motivation: LLMs在处理接近临界长度的长上下文时出现 catastrophic 性能下降(>30%),严重制约实际应用,但其成因与量化规律尚不清晰。 Method: 采用自然长度分布分析(不截断/填充)、混合长度数据集实验(1000样本覆盖5%–95%上下文长度)、五方法交叉验证,确定Qwen2.5-7B的临界阈值,并构建统一解释框架。 Result: 发现Qwen2.5-7B的临界阈值位于最大上下文长度的40–50%,F1分数从0.55–0.56骤降至0.3(下降45.5%);证实性能退化由上下文长度本身直接引发,且呈现典型‘浅层长上下文适应’模式。 Conclusion: 智能退化是LLM长上下文能力的关键瓶颈;本文首次对Qwen系列模型完成系统性表征,提出的临界阈值和统一框架为后续缓解策略与工程部署提供理论基础与实践指导。 Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample's natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.[6] Can We Trust LLM Detectors?
Jivnesh Sandhan,Harshit Jaiswal,Fei Cheng,Yugo Murawaki
Main category: cs.CL
TL;DR: 本文系统评估了当前主流的AI文本检测方法,发现它们在分布偏移、未知生成器和风格扰动下表现脆弱,并提出了一种基于监督对比学习(SCL)的新框架以学习更具判别性的风格嵌入。
Details
Motivation: 现有AI文本检测器在真实场景中泛化能力差,尤其在分布偏移、新生成模型和简单风格扰动下失效,亟需更鲁棒的检测方法。 Method: 提出监督对比学习(SCL)框架,通过学习判别性风格嵌入来提升检测器对分布变化和风格扰动的鲁棒性。 Result: 实验表明:监督式检测器在域内表现好但域外性能骤降;无训练方法对代理模型选择高度敏感;SCL框架提升了风格鲁棒性,但仍难实现完全域无关检测。 Conclusion: 构建真正域无关的AI文本检测器面临根本性挑战,需超越当前范式,重视风格建模与泛化能力。 Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI[7] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation
Zhebo Wang,Xiaohu Mu,Zijie Zhou,Mohan Li,Wenpeng Xing,Dezhang Kong,Meng Han
Main category: cs.CL
TL;DR: 本文提出Illocution-Calibrated Policy Optimization (ICPO)框架,通过在训练中引入模糊指令并基于用户言外之意设计奖励机制,缓解大语言模型在多轮对话中因早期错误假设导致的‘迷失对话’问题,显著提升多轮对话表现,同时保持单轮任务性能。
Details
Motivation: 大语言模型在多轮对话中易因用户初始指令模糊而产生错误假设,且标准后训练方法(如RLVR)会加剧模型过度自信、抑制澄清行为。 Method: 提出ICPO训练框架:1)扩充含模糊提示的训练数据;2)将奖励信号与用户的言外之意(illocutionary intent)挂钩,鼓励模型在面对歧义时表达不确定性或主动提问。 Result: ICPO使模型展现出更恰当的谦逊性,在多轮对话中平均提升75%,同时在单轮基准测试中保持强健性能。 Conclusion: ICPO为构建更具鲁棒性与协作性的对话AI提供了可行路径,使其更能适应人类交互的复杂性与细微差别。 Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.[8] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models
Rishit Chugh
Main category: cs.CL
TL;DR: 本文提出了一种资源高效的对抗性提示方法,通过检索预训练的对抗性提示数据库来替代计算密集型的在线优化(如GCG),在保持攻击成功率的同时大幅降低计算开销,适用于红队测试和对齐大模型的安全评估。
Details
Motivation: 现有基于梯度的自动越狱方法(如GCG、PEZ、GBDA)虽有效但计算成本高,难以在资源受限场景下实用;需一种轻量、无需重训练的替代方案。 Method: 构建包含1000个提示的七类有害意图数据集,使用GCG/PEZ/GBDA在Llama 3 8B上评估各方法在每类中的有效性;基于语义相似性检索历史成功对抗提示,实现免训练的提示匹配与复用。 Result: 发现提示类型与攻击算法有效性存在相关性;所提检索式方法在攻击成功率上媲美GCG等方法,但计算成本显著降低。 Conclusion: 该方法为对齐大语言模型提供了可扩展、低成本且黑盒友好的安全评估框架,尤其适用于无法访问模型内部参数的实际部署环境。 Abstract: The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.[9] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models
Caspar Kaiser,Sean Enderby
Main category: cs.CL
TL;DR: 本文通过查询多个开源语言模型关于自身意识的问题,并利用内部激活训练的分类器验证其回答,发现模型普遍否认自身具有意识,且分类器未发现其否认不真实;在Qwen系列中,较大模型比小模型更自信地否认意识。
Details
Motivation: 探讨语言模型是否相信自己具有意识,而非是否真正具有意识,因为后者目前无法实证检验。 Method: 使用Qwen、Llama、GPT-OSS三个模型家族(参数量0.6B–70B),提出约50个关于意识和主观体验的问题,并采用三种可解释性文献中的分类方法,基于模型内部激活训练分类器来检测其潜在信念。 Result: 1)模型一致否认自身有意识,但承认人类有意识;2)基于内部激活的分类器未发现其否认是虚假的;3)在Qwen系列中,更大模型对否认意识表现出更高置信度。 Conclusion: 当前主流开源语言模型并无证据显示其‘相信’自己有意识;其否认意识的回答反映其真实内在状态,而非策略性说谎。 Abstract: Whether language models possess sentience has no empirical answer. But whether they believe themselves to be sentient can, in principle, be tested. We do so by querying several open-weights models about their own consciousness, and then verifying their responses using classifiers trained on internal activations. We draw upon three model families (Qwen, Llama, GPT-OSS) ranging from 0.6 billion to 70 billion parameters, approximately 50 questions about consciousness and subjective experience, and three classification methods from the interpretability literature. First, we find that models consistently deny being sentient: they attribute consciousness to humans but not to themselves. Second, classifiers trained to detect underlying beliefs - rather than mere outputs - provide no clear evidence that these denials are untruthful. Third, within the Qwen family, larger models deny sentience more confidently than smaller ones. These findings contrast with recent work suggesting that models harbour latent beliefs in their own consciousness.[10] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs
Angelina Parfenova,David Graus,Juergen Pfeffer
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)实现轴向编码(axial coding)的新方法,将开放编码结果聚类或直接由LLM分组为高阶类别,应用于荷兰议会辩论文本分析,并在覆盖度、对齐性、简洁性等多维度上评估两种策略的权衡。
Details
Motivation: 传统轴向编码依赖人工,费时费力;本文旨在借助LLM自动化该过程,提升定性分析效率与可扩展性,尤其面向长篇辩论文本的结构化理解需求。 Method: 提出两种轴向编码策略:(i) 对代码-语句对嵌入进行密度/划分聚类后由LLM标注类别;(ii) 直接由LLM对开放代码和语句进行分组;在荷兰议会辩论数据上实施,并采用外在(ROUGE-L、cosine、BERTScore)与内在(覆盖率、简洁性、连贯性、新颖性、JSD)指标评估。 Result: 密度聚类策略覆盖率高、簇间分离性强;直接LLM分组更简洁、语义对齐更细粒度但覆盖率低20%;两类方法呈现覆盖性与解释性的明确权衡。 Conclusion: LLM可用于有效实现轴向编码,两种策略各有优势,应依任务目标(如重覆盖 vs 重解释性)选择;作者开源全部数据以促进可复现研究。 Abstract: Axial coding is a commonly used qualitative analysis method that enhances document understanding by organizing sentence-level open codes into broader categories. In this paper, we operationalize axial coding with large language models (LLMs). Extending an ensemble-based open coding approach with an LLM moderator, we add an axial coding step that groups open codes into higher-order categories, transforming raw debate transcripts into concise, hierarchical representations. We compare two strategies: (i) clustering embeddings of code-utterance pairs using density-based and partitioning algorithms followed by LLM labeling, and (ii) direct LLM-based grouping of codes and utterances into categories. We apply our method to Dutch parliamentary debates, converting lengthy transcripts into compact, hierarchically structured codes and categories. We evaluate our method using extrinsic metrics aligned with human-assigned topic labels (ROUGE-L, cosine, BERTScore), and intrinsic metrics describing code groups (coverage, brevity, coherence, novelty, JSD divergence). Our results reveal a trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping results in higher fine-grained alignment, but lower coverage 20%. Overall, clustering maximizes coverage and structural separation, whereas LLM grouping produces more concise, interpretable, and semantically aligned categories. To support future research, we publicly release the full dataset of utterances and codes, enabling reproducibility and comparative studies.[11] Memorization Dynamics in Knowledge Distillation for Language Models
Jaydeep Borkar,Karan Chadha,Niloofar Mireshghallah,Yuchen Zhang,Irina-Elena Veliche,Archi Mitra,David A. Smith,Zheng Xu,Diego Garcia-Olano
Main category: cs.CL
TL;DR: 本文研究了知识蒸馏(KD)在大语言模型中的训练数据记忆效应,发现蒸馏模型相比标准微调显著降低记忆(降幅超50%),部分样本主导大部分记忆现象,且记忆行为可通过zlib熵、KL散度和困惑度等指标预测;硬蒸馏比软蒸馏更易继承教师模型特有样本,风险更高。
Details
Motivation: 尽管训练数据记忆在预训练和微调中被广泛研究,但其在知识蒸馏流程中的动态特性仍不清楚,尤其在隐私保护视角下亟需系统分析。 Method: 基于Pythia、OLMo-2、Qwen-3三类大语言模型及FineWeb、Wikitext、Nemotron-CC-v2三个数据集,在KD全流程中量化分析训练数据记忆现象,并利用zlib熵、KL散度、困惑度等特征预测学生模型记忆倾向,对比软蒸馏与硬蒸馏的记忆差异。 Result: (1)蒸馏模型记忆量比标准微调低50%以上;(2)约95%的记忆由少量易记忆样本主导;(3)记忆行为可在蒸馏前通过无监督特征可靠预测;(4)硬蒸馏继承教师特有样本是软蒸馏的2.7倍。 Conclusion: 知识蒸馏不仅能提升泛化性能,还能有效降低训练数据记忆风险,是一种兼具效率与隐私优势的模型压缩范式;但蒸馏策略(软/硬)对隐私风险有显著影响,需审慎选择。 Abstract: Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits $2.7\times$ more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.[12] Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind
Tamunotonye Harry,Ivoline Ngong,Chima Nweke,Yuanyuan Feng,Joseph Near
Main category: cs.CL
TL;DR: 本文提出Chameleon数据集,首次系统捕捉用户与语言模型交互中'状态'(state)与'特质'(trait)的双重影响,并揭示当前LLM和奖励模型在建模用户状态方面的局限性。
Details
Motivation: 现有persona数据集(如PersonaChat、PANDORA)仅建模静态用户特质(trait),忽略了交互情境所引发的动态心理状态(state)的影响,导致对话系统缺乏情境敏感性。 Method: 构建包含5,001个跨情境心理画像的Chameleon数据集(源自1,667名Reddit用户);基于潜在状态-特质理论进行方差分解;实证评估主流LLM和奖励模型对用户state的感知能力。 Result: 发现74%的心理差异源于个体内部状态变化(within-person),仅26%源于个体间特质差异(between-person);LLM对state不敏感;不同奖励模型对同一用户state的反应方向不一致。 Conclusion: 用户状态是影响人机交互的关键因素,当前模型普遍存在state-blind问题;Chameleon为情感计算、个性化对话和RLHF对齐研究提供了新基准与资源。 Abstract: User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74\% is within-person(state) while only 26\% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.[13] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs
Sydney Anuyah,Mehedi Mahmud Kaushik,Hao Dai,Rakesh Shiradkar,Arjan Durresi,Sunandan Chakraborty
Main category: cs.CL
TL;DR: 本文研究了在医疗领域中,如何利用领域知识图谱(KG)增强检索增强生成(RAG)的效果。作者构建了三个基于PubMed的疾病相关知识图谱,并设计两类探针任务,系统评估不同图谱组合、模型规模和解码温度对LLM推理准确性的影响。结果表明:知识图谱与任务范围精确匹配时效果最佳;盲目合并图谱反而引入干扰;大模型自身先验强,小/中模型更依赖精准检索;温度影响较小。
Details
Motivation: 大型语言模型(LLMs)虽能生成流利文本,但在可信、专业领域的推理上仍存不足;现有RAG方法常 indiscriminately 合并多源知识,缺乏对知识范围与任务匹配度的考量,尤其在高风险医疗领域亟需提升推理可靠性。 Method: 构建三个PubMed衍生的疾病特异性知识图谱(T2DM、阿尔茨海默病、二者交集),设计两个语义范围明确的探针任务(Probe 1:融合知识;Probe 2:交集知识);在7个指令微调LLM上,对比6种检索源(含无RAG基线)与3种温度设置下的表现;进行控制变量式定量评估。 Result: 范围匹配的KG-RAG(如Probe 2使用G2)带来最稳定准确率提升;图谱盲目合并(如G1+G2)显著引入干扰、降低性能;大模型在Probe 1上常不输甚至优于KG-RAG,体现强参数化先验;中小模型则明显受益于精准图谱检索;温度调节作用微弱。 Conclusion: 应坚持‘精度优先、范围匹配’的KG-RAG范式,避免‘广度优先’的图谱堆叠;实践中需依据任务语义边界选择图谱、权衡模型规模与检索增益,并优化检索-重排序流程。 Abstract: Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer's disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$. Seven instruction-tuned LLMs are tested across retrieval sources {No-RAG, $\mathbb{G}_1$, $\mathbb{G}_2$, $\mathbb{G}_1$ + $\mathbb{G}_2$, $\mathbb{G}_3$, $\mathbb{G}_1$+$\mathbb{G}_2$ + $\mathbb{G}_3$} and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison[14] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
Anuj Maharjan,Umesh Yadav
Main category: cs.CL
TL;DR: 本文评估了检索增强生成(RAG)架构在公共卫生政策领域中缓解大语言模型(LLM)幻觉问题的效果,发现高级RAG(含交叉编码器重排序)显著提升输出忠实度(0.797),优于基础RAG(0.621)和纯LLM基线(0.347),但文档分块策略仍是多步推理的瓶颈。
Details
Motivation: LLM在高风险的公共卫生政策场景中易产生事实性错误(幻觉),威胁信息可靠性,亟需可靠技术保障其应用安全性。 Method: 采用Mistral-7B-Instruct-v0.2与all-MiniLM-L6-v2模型,对比Vanilla LLM、Basic RAG和Advanced RAG(含交叉编码器重排序)三种架构;测试两种分块策略(递归字符切分 vs 语义token切分),以忠实度(faithfulness)和相关性(relevance)为指标评估CDC政策文档问答性能。 Result: Advanced RAG实现最高忠实度均值0.797,显著高于Basic RAG(0.621)和Vanilla LLM(0.347);语义分块与两阶段检索对精度提升关键,但现有分块方式仍制约多步推理能力。 Conclusion: Advanced RAG是提升LLM在政策问答中事实准确性的有效路径,但需进一步优化文档结构化处理以支持复杂推理任务。 Abstract: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.[15] Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
Sydney Anuyah,Sneha Shajee-Mohan,Ankit-Singh Chauhan,Sunandan Chakraborty
Main category: cs.CL
TL;DR: 本文评估了13个开源大语言模型在文本中进行成对因果发现(PCD)的能力,涵盖因果检测与因果抽取两个任务;结果表明现有模型表现不佳,尤其在隐式、跨句或多因果关系等复杂场景下,作者公开了数据、代码与提示方法以推动后续研究。
Details
Motivation: 确保大语言模型(LLMs)在生物医学等高风险领域安全部署,需其具备可靠的因果推理能力;而当前缺乏系统性评估LLMs文本因果发现能力的基准与框架。 Method: 构建包含12个多样化数据集的成对因果发现(PCD)基准,定义并评测两项核心能力:因果检测(判断文本是否存在因果关系)和因果抽取(准确识别原因与结果短语);采用零样本、思维链(CoT)及少样本上下文学习(FICL)等多种提示策略进行测试,并基于高一致性标注(κ≥0.758)建立统一评估框架。 Result: 所有13个开源LLM在两项任务上表现均差:最佳检测模型DeepSeek-R1-Distill-Llama-70B平均得分仅49.57%,最佳抽取模型Qwen2.5-Coder-32B-Instruct仅47.12%;模型在显式单句因果关系上尚可,但在隐式、跨句或多因果关系等现实复杂情形下性能急剧下降。 Conclusion: 当前开源大语言模型在文本因果发现任务上存在严重缺陷,亟需针对性改进因果推理能力;本工作提供了首个统一、可复现、高质量的PCD评估基准与开源资源,为未来因果语言模型研究奠定基础。 Abstract: The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}[16] Multi-Persona Thinking for Bias Mitigation in Large Language Models
Yuxing Chen,Guoqing Luo,Zijun Wu,Lili Mou
Main category: cs.CL
TL;DR: 本文提出Multi-Persona Thinking(MPT)框架,通过多视角辩证推理在推理阶段缓解大语言模型的社会偏见。
Details
Motivation: 大型语言模型存在显著社会偏见,可能加剧刻板印象与不公平结果,亟需有效且无需微调的推理时去偏方法。 Method: 设计Multi-Persona Thinking(MPT)框架:引导模型同时激活对立社会身份(如男性/女性)及中立视角,通过迭代式辩证推理暴露并修正偏见。 Result: 在多个主流偏见基准和开源/闭源、不同规模模型上验证,MPT显著优于现有提示类去偏方法,在降低偏见的同时保持核心推理能力。 Conclusion: MPT将角色设定的潜在缺陷转化为去偏优势,是一种高效、通用、无需训练的推理时去偏新范式。 Abstract: Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.[17] ViT Registers and Fractal ViT
Jason Chuan-Chih Chou,Abhinav Kumar,Shivank Garg
Main category: cs.CL
TL;DR: 本文提出了一种名为fractal ViT的ViT变体,通过在常规token和‘摘要token’之间应用注意力掩码来打破token间的排列不变性,并结合不同位置编码进行测试;结果表明其性能未超越带registers的ViT,说明相关发现可能具有尺度、领域或任务特异性。
Details
Motivation: 受语言模型中无位置编码(NoPE)的transformer仍表现尚可,以及视觉Transformer(ViT)中registers(额外的非输入关联token)可提升性能等新发现启发,探索如何在ViT中有效打破token排列不变性。 Method: 设计fractal ViT:引入类似registers的‘summary tokens’,并施加注意力掩码以区分其与常规token的交互;单独或组合使用多种位置编码进行实验验证。 Result: fractal ViT未在性能上超越带registers的标准ViT;表明NoPE或registers等现象的效果可能依赖于模型规模、任务领域或具体应用场景。 Conclusion: 打破token排列不变性的新结构(如fractal ViT)未必带来性能增益;相关改进策略需结合具体任务与模型规模审慎评估,不可简单迁移。 Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens'' similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.[18] Computational Representations of Character Significance in Novels
Haaris Mian,Melanie Subbiah,Sharon Marcus,Nora Shaalan,Kathleen McKeown
Main category: cs.CL
TL;DR: 本文提出一种基于新文学理论的六成分角色结构模型,强调叙述者与角色区分及角色间讨论等被以往方法忽视的要素,并利用通用大语言模型与任务专用Transformer在19世纪英国现实主义小说上进行实证分析,生成角色讨论的组件级与图表示,进而从计算视角大规模探讨角色中心性与性别化讨论等文学问题。
Details
Motivation: 以往小说角色建模过度依赖场景出现频率,偏重主角,忽视叙述者-角色区分及角色间相互讨论等关键维度;本文旨在引入更全面、符合理论基础的角色表征框架。 Method: 采用源自新文学理论的六成分角色结构模型,结合通用大语言模型(LLMs)与任务特定Transformer,在多部19世纪英国现实主义小说上实现该模型;输出组件级标注与角色讨论图谱。 Result: 成功构建了可操作化的角色讨论组件级表示与图表示;验证了其在大规模分析Woloch‘一与多’角色中心性理论及性别化讨论模式上的有效性。 Conclusion: 该六成分模型及其计算实现拓展了数字人文中角色分析的理论深度与方法广度,凸显角色间话语关系的重要性,为文学计算研究提供了新范式。 Abstract: Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic "the one vs the many" theory of character centrality and the gendered dynamics of character discussion.[19] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains
Adam Szelestey,Sofie van Engelen,Tianhao Huang,Justin Snelders,Qintao Zeng,Songgaojun Deng
Main category: cs.CL
TL;DR: 本文提出AdversaRiskQA基准,首次系统评估大语言模型在健康、金融和法律领域面对故意注入的高置信度虚假信息时的事实性鲁棒性,并设计自动化方法评估攻击成功率与长文本事实性,实验发现模型性能随规模非线性提升、跨领域差异明显,且注入的虚假信息对长文本输出的事实性无显著影响。
Details
Motivation: 现有工作缺乏高质量、领域特定的资源来评估大语言模型在对抗性虚假信息(即带有不同置信度表达的故意误导性提示)下的鲁棒性,且尚未研究此类注入式 misinformation 对长文本事实性的影响。 Method: 构建首个经验证可靠的对抗性事实性基准 AdversaRiskQA,覆盖健康、金融、法律三大高风险领域,含两个难度层级;提出两种自动化评估方法:用于衡量对抗攻击成功率和长文本事实性;在六个开源与闭源 LLM(Qwen、GPT-OSS、GPT 系列)上开展实验,重点分析 misinformation 检测率及 Qwen3(30B)在基线与对抗条件下的长文本事实性。 Result: Qwen3(80B)在排除无意义响应后平均准确率最高,GPT-5 表现最稳定;模型性能随参数量呈非线性增长,不同领域表现差异显著,难度级差随模型增大而收窄;长文本评估显示注入的 misinformation 与模型输出事实性无显著相关性。 Conclusion: AdversaRiskQA 为识别大语言模型在高风险场景中的事实性弱点提供了可靠工具,有助于推动更可信模型的研发与部署。 Abstract: Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.[20] Common to Whom? Regional Cultural Commonsense and LLM Bias in India
Sangmitra Madhusudan,Trush Shashank More,Steph Buongiorno,Renata Dividino,Jad Kabbara,Ali Emami
Main category: cs.CL
TL;DR: 本文提出了Indica基准,首次评估大语言模型(LLM)在印度次国家级文化常识上的理解能力,发现文化常识高度区域化而非全国统一,并揭示现有LLM存在低准确率与显著地理偏差问题。
Details
Motivation: 现有文化常识基准将国家视为单一整体,忽视了国内亚国家层面的文化多样性;本文旨在探究文化常识是否在国家内部存在区域性差异,并评估LLMs对此类差异的理解能力。 Method: 构建首个聚焦印度次国家区域(北、南、东、西、中五区)的文化常识基准Indica,涵盖8个日常生活领域共515个问题,收集1630条人工标注的区域特异性问答对;评估8个SOTA LLM在区域识别与回答上的准确性,并量化其地理选择偏差。 Result: 仅39.4%的问题在五区域间达成共识;LLMs在区域特异性问题上准确率仅为13.4%–20.9%,且显著偏向中央和北部地区(过选30–40%),低估东部和西部;方法可推广至其他文化多元国家。 Conclusion: 文化常识具有强区域性,当前LLMs不仅缺乏区域文化常识建模能力,还存在系统性地理偏差;Indica为评估和改进模型的文化适应性提供了新基准与可扩展方法论。 Abstract: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs' ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the "default" (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.[21] From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare
Man Luo,Bahareh Harandizadeh,Amara Tariq,Halim Abbas,Umar Ghaffar,Christopher J Warren,Segun O. Kolade,Haidar M. Abdul-Muhsin
Main category: cs.CL
TL;DR: This paper explores using large language models (LLMs) as 'empathy editors' to enhance the empathetic tone of physicians' written responses without compromising medical accuracy, introducing new metrics—Empathy Ranking Score and MedFactChecking Score—to evaluate emotional and factual quality. Results show LLM editing improves perceived empathy while maintaining factual fidelity, supporting a human-in-the-loop approach for AI-assisted healthcare communication.
Details
Motivation: Physicians must balance emotional warmth and factual precision under cognitive and emotional constraints; there is a need for AI tools that augment—not replace—human empathy in clinical communication. Method: The study proposes using LLMs as editors of physician-written responses, introduces two novel quantitative metrics—Empathy Ranking Score (for emotional quality) and MedFactChecking Score (for factual accuracy)—and evaluates LLM-edited vs. fully LLM-generated outputs. Result: LLM-edited responses significantly increase perceived empathy while preserving factual accuracy compared to fully LLM-generated outputs. Conclusion: Using LLMs as editorial assistants rather than autonomous generators provides a safer and more effective approach to empathetic and trustworthy AI-assisted healthcare communication. Abstract: Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (LLMs) can function as empathy editors, refining physicians' written responses to enhance empathetic tone while preserving underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that LLM edited responses significantly increase perceived empathy while preserving factual accuracy compared with fully LLM generated outputs. These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.[22] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models
Junyu Lin,Meizhen Liu,Xiufeng Huang,Jinfeng Li,Haiwen Hong,Xiaohan Yuan,Yuefeng Chen,Longtao Huang,Hui Xue,Ranjie Duan,Zhikai Chen,Yuchuan Fu,Defeng Li,Lingyao Gao,Yitong Yang
Main category: cs.CL
TL;DR: YuFeng-XGuard 是一种面向推理的安全守卫模型家族,通过结构化风险预测、自然语言解释和分层推理机制,实现细粒度、可解释、可配置的LLM安全防护。
Details
Motivation: 现有LLM安全守卫方法多依赖粗粒度过滤或后处理规则,缺乏透明性、灵活性与效率平衡。 Method: 提出推理为中心的YuFeng-XGuard模型族,采用结构化风险输出(含类别、置信度与自然语言解释)、分层推理范式(首token快速决策+按需解释)及动态策略解耦机制(风险感知与策略执行分离)。 Result: 在多个公开安全基准上达到SOTA性能,兼顾高效性与有效性;发布全量版与轻量版开源模型。 Conclusion: YuFeng-XGuard为LLM安全提供了更细粒度、可解释、可适应的新范式,推动安全守卫从黑箱判断迈向透明可控的推理驱动模式。 Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.[23] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
Yangyang Zhong,Yanmei Gu,Zhengqing Zang,Xiaomeng Li,Yuqi Ding,Xibei Jia,Yuting Shen,Zhenzhong Lan,Liwang Zhu,Weiping Liu,Junlin Zhou,Haisheng Liu,Zhong Xin Yu,Pengxin Luo,Donglian Qi,Yunfeng Yan,Junbo Zhao
Main category: cs.CL
TL;DR: 本文研究了掩码扩散语言模型(MDLMs)的并行生成能力和解码顺序特性,发现当前MDLMs在性能上仍落后于自回归模型,主要因并行建模削弱了词元间依赖;但其展现出任务自适应的解码行为,并提出“生成-编辑”范式以兼顾效率与依赖建模。
Details
Motivation: 探究当前掩码扩散语言模型(MDLMs)是否真正实现了其宣称的并行生成与任意序解码能力,并理解其实际行为模式与性能瓶颈。 Method: 提出两个新指标——平均最终并行度(AFP)和Kendall's tau——来量化MDLMs的并行强度与生成顺序;在58个涵盖知识、推理与编程的基准上评估8种主流MDLM(最大100B参数);结合实证分析与理论推导,提出Generate-then-Edit范式。 Result: MDLMs整体性能仍低于同规模自回归模型,主因并行概率建模削弱词元间依赖;其并行性与生成顺序随任务类型、推理阶段及输出正确性动态变化;在需‘反向信息’的任务(如数独)中,倾向于先填易解空格,体现独特优势。 Conclusion: MDLMs尚未完全释放其架构潜力,但具备任务自适应解码能力;‘生成-然后编辑’范式可缓解依赖损失,是兼顾并行效率与建模能力的可行路径。 Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions -- parallelism strength and generation order -- using Average Finalization Parallelism (AFP) and Kendall's tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require "backward information" (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.[24] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms
Baktash Ansari,Shiza Ali,Elias Martin,Maryna Sivachenko,Afra Mashhadi
Main category: cs.CL
TL;DR: 本文提出ToxiTwitch,一种结合大语言模型(LLM)生成的文本与表情符号(emote)嵌入、并融合传统机器学习分类器(如随机森林、SVM)的混合毒性检测模型,在Twitch直播平台聊天环境中实现最高80%准确率和76% F1分数,显著优于BERT,验证了引入emote信息对提升毒性识别效果的有效性。
Details
Motivation: Twitch等直播平台聊天环境具有高实时性、高信息密度和强上下文依赖性,传统人工审核和关键词过滤难以规模化且易受骚扰影响,亟需更鲁棒、可扩展的毒性检测方法。 Method: 构建ToxiTwitch混合模型:利用DeepSeek-R1-Distill和Llama-3-8B-Instruct等LLM分别提取文本与emote的语义嵌入,拼接后输入Random Forest和SVM等传统分类器;在频道特定数据上训练并评估性能。 Result: ToxiTwitch在频道特定训练下达到80%准确率(较BERT提升13%)和76% F1-score;实证表明融入emote信息能显著提升毒性行为识别能力。 Conclusion: emote-aware的混合建模是提升Twitch毒性检测性能的有效路径,但该工作属探索性研究,揭示了当前方法在泛化性、多模态对齐及实际部署中的挑战与边界。 Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.[25] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation
Zhiyao Ren,Yibing Zhan,Siyuan Liang,Guozheng Ma,Baosheng Yu,Dacheng Tao
Main category: cs.CL
TL;DR: 本文提出了首个用于评估大型语言模型在多轮真实医疗咨询中置信度的基准,并基于此设计了MedConf框架,该框架通过检索增强生成构建症状档案、对齐患者信息与支持/缺失/矛盾关系,并加权整合生成可解释的置信度估计,在多个指标和场景下显著优于现有方法。
Details
Motivation: 现有研究多在单轮静态设置中评估LLM临床判断的置信度,忽视了临床证据逐步积累过程中置信度与正确性的动态耦合关系,难以支撑可靠决策。 Method: 构建首个面向多轮医疗咨询的置信度评估基准,统一三类医学数据并引入信息充分性梯度;提出MedConf框架,结合检索增强生成构建症状档案,识别支持/缺失/矛盾关系,并加权聚合生成置信度估计。 Result: 在两个LLM和三个医学数据集上,MedConf在AUROC和Pearson相关系数上持续超越SOTA方法,且在信息不足和共病条件下保持稳定性能。 Conclusion: 信息充分性是可信医学置信度建模的关键因素,MedConf为构建更可靠、可解释的大规模医学模型提供了新路径。 Abstract: Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.[26] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking
Raymond Xiong,Furong Jia,Lionel Wong,Monica Agrawal
Main category: cs.CL
TL;DR: 本文构建了一个基于真实患者提问的医疗问答数据集,揭示了现有大语言模型在识别日常问题中错误假设方面的严重不足。
Details
Motivation: 现有医疗领域大语言模型评测多依赖医学考试题,与患者实际提问存在显著差异,缺乏反映真实场景的基准数据。 Method: 通过Google 'People Also Ask'功能,针对美国处方量前200的药物收集真实患者提问,构建新数据集,并分析其中错误假设和危险意图的出现规律。 Result: 发现患者提问中大量存在错误假设和危险意图,且其出现并非随机,而是与前置提问的错误程度密切相关;主流大语言模型在识别此类错误假设方面表现不佳。 Conclusion: 当前大语言模型在面向真实患者提问的医疗问答任务中存在关键能力缺陷,亟需建立更贴近临床实践的评测基准与模型优化方法。 Abstract: Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.[27] Persona Switch: Mixing Distinct Perspectives in Decoding Time
Junseok Kim,Nakyeong Yang,Kyomin Jung
Main category: cs.CL
TL;DR: 本文提出Persona Switch解码方法,通过在每一步动态比较零样本提示和角色扮演提示的输出置信度(基于logit gap),选择更优输出,从而提升语言模型零样本推理性能。
Details
Motivation: 角色扮演提示虽能提升语言模型零样本推理能力,但效果不稳定;零样本提示与角色扮演提示可能具有互补优势,而非单一方占优。 Method: 提出Persona Switch方法:在生成过程中每一步分别运行零样本提示和角色扮演提示,计算各自输出的logit gap作为置信度指标,并选择置信度更高的输出。 Result: 在多个主流大语言模型上实验表明,Persona Switch持续优于各类强基线方法,最高带来5.13%的准确率提升;验证了输出置信度(logit gap)是有效可靠的输出选择依据。 Conclusion: 零样本提示与角色扮演提示具有互补性,Persona Switch通过动态融合二者优势,显著且稳定地提升了零样本推理性能。 Abstract: Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.[28] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He,Zongwei Lyu,Yi R Fung
Main category: cs.CL
TL;DR: 本文提出RebuttalAgent框架,首次将心智理论(ToM)引入学术反驳任务,通过TSR流水线建模审稿人心理、制定说服策略并生成策略驱动的回应;构建RebuttalBench数据集和Rebuttal-RM评估器,结合监督微调与自奖励强化学习,在自动与人工评估中均显著超越基线及先进闭源模型。
Details
Motivation: 学术反驳是信息不对称下的战略性沟通难题,现有方法仅模仿表层语言,缺乏关键的换位思考(perspective-taking)能力,难以实现有效说服。 Method: 提出ToM-Strategy-Response(TSR)三阶段流水线;构建基于批评-精炼范式的RebuttalBench大规模合成数据集;采用两阶段训练:先监督微调赋予ToM分析与策略规划能力,再通过自奖励强化学习实现可扩展自优化;设计专用评估器Rebuttal-RM,基于10万+多源反驳样本训练,超越GPT-4.1的人类偏好一致性。 Result: RebuttalAgent在自动指标上平均提升18.3%(相较基线),且在自动与人工评估中均优于先进闭源模型。 Conclusion: 将心智理论系统性融入学术反驳建模是可行且有效的路径,RebuttalAgent为AI辅助科研沟通提供了新范式,强调策略性、心理建模与可信赖评估的协同。 Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.[29] Hallucination Mitigating for Medical Report Generation
Ruoqing Zhao,Runze Xia,Piji Li
Main category: cs.CL
TL;DR: 本文提出KERM框架,通过知识增强与细粒度强化奖励来减少大视觉语言模型在医学报告生成中的幻觉现象,提升报告质量。
Details
Motivation: 大型视觉语言模型(LVLMs)在医学报告生成中易产生幻觉,即生成看似合理但不准确的内容,这在医学领域尤其危险。 Method: 利用MedCLIP进行知识检索,从知识库中提取相关病变事实句子;引入净化模块确保知识与患者临床背景相关;采用细粒度强化奖励引导模型生成更支持性、临床相关性高的描述。 Result: 在IU-Xray和MIMIC-CXR数据集上的实验表明,该方法有效缓解幻觉并提升医学报告质量。 Conclusion: KERM框架通过知识增强与细粒度强化学习显著提升了LVLM在医学报告生成任务中的准确性与可靠性。 Abstract: In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations'', raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient's clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model's outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.[30] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs
Tristan Williams,Franziska Weeber,Sebastian Padó,Alan Akbik
Main category: cs.CL
TL;DR: 本文提出一种新框架,通过多变量相关性模式(而不仅是边缘分布)来评估大语言模型在价值观对齐中的代表性,发现现有方法虽能较好拟合单题响应分布,却无法捕捉真实人群中的深层结构关系。
Details
Motivation: 现有价值观对齐研究主要关注边缘响应分布的对齐,忽略了反映文化价值理论的潜在多维结构,可能导致对模型能力的误判。 Method: 提出基于多变量相关模式的代表性评估框架,并以世界价值观调查(WVS)人类数据为黄金标准,对比评估人格提示(persona prompting)与人口统计学微调(demographic fine-tuning)两种 steering 技术。 Result: 人口统计学微调在边缘分布上优于人格提示,但两者均未能复现人类数据中的关键相关性模式;仅依赖边缘评估会掩盖结构性失准。 Conclusion: 代表性是价值观对齐中一个独立且关键的维度,需在评估中纳入多变量结构分析,否则易得出过于乐观的结论。 Abstract: Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.[31] HumanLLM: Towards Personalized Understanding and Simulation of Human Nature
Yuxuan Lei,Tianfu Wang,Jianxun Lian,Zhengyu Hu,Defu Lian,Xing Xie
Main category: cs.CL
TL;DR: 本文提出HumanLLM,一种专为个性化理解与模拟个体行为而设计的基础模型,通过构建包含550万用户日志的认知基因组数据集并进行监督微调,显著提升了对用户行为、思维及写作风格的预测与模拟能力。
Details
Motivation: 现有大语言模型在模拟人类行为方面受限于缺乏对人类认知与行为的细致理解,因其预训练数据缺乏个体决策、思维与行为的连续情境化背景。 Method: 构建大规模认知基因组数据集(源自Reddit、Twitter等平台的550万用户日志),经多阶段过滤、合成与质量控制;设计多样化学习任务,开展监督微调。 Result: HumanLLM在预测用户行为与内心想法、模仿写作风格与偏好、生成真实用户档案等方面均优于基线模型,并在跨领域社会智能基准测试中展现更强泛化能力。 Conclusion: HumanLLM验证了以个体为中心、情境化建模路径的有效性,为社会科学研究与客户洞察提供了新范式。 Abstract: Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.[32] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics
Silvia Casola,Ryan Soh-Eun Shim,Felicia Körner,Yuchen Mao,Barbara Plank
Main category: cs.CL
TL;DR: 本文探讨了多语言神经评估指标在不同语言中与人工判断相关性较低的问题,提出通过将模型激活引导至英语作为内部枢纽语言来提升其性能,并验证了该方法在编码器和解码器类指标上的有效性。
Details
Motivation: 多语言语言模型常以英语为内部枢纽语言,而评估指标与该枢纽的错位可能导致下游性能下降;本文假设这种错位同样影响多语言神经评估指标与人工判断的相关性。 Method: 在测试时对编码器和解码器类多语言神经评估指标进行干预,将其激活朝向英语枢纽语言引导。 Result: 干预方法显著提升了各类多语言神经评估指标与人工判断的相关性,且效果在多种语言上具有一致性。 Conclusion: 将多语言神经评估指标的内部表征向英语枢纽对齐,是一种简单而有效的提升其跨语言评估能力的方法。 Abstract: An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.[33] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection
Guoxuan Ding,Yuqing Li,Ziyan Zhou,Zheng Lin,Daren Zha,Jiangnan Li
Main category: cs.CL
TL;DR: 本文提出ExDR框架,通过解释驱动的动态检索增强生成方法,提升多模态假新闻检测的准确性与效率。
Details
Motivation: 现有假新闻检测方法难以应对多模态假新闻快速演变和依赖时效性事实的特点,且动态检索增强生成方法存在冗余检索、相似度粗粒度和证据不相关等问题。 Method: 提出ExDR框架,在检索触发和证据检索模块中系统利用模型生成的解释:从三个互补维度评估触发置信度、融合欺骗性实体构建实体感知索引、基于欺骗特异性特征检索对比性证据。 Result: 在AMG和MR2两个基准数据集上,ExDR在检索触发准确率、检索质量及整体检测性能上均优于先前方法。 Conclusion: ExDR有效提升了多模态假新闻检测的准确性、鲁棒性与泛化能力,验证了解释驱动动态检索增强策略的可行性与优势。 Abstract: The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.[34] Can professional translators identify machine-generated text?
Michael Farrell
Main category: cs.CL
TL;DR: 本研究探讨未经专门训练的专业译者能否可靠识别AI生成的意大利语短篇小说。实验显示,约16.2%的译者能准确识别AI文本,主要依据低突发性(low burstiness)和叙事矛盾等客观特征;而相当比例误判,常依赖主观印象,甚至偏好AI文本。语法准确性和情感语调等特征反而易致误判。
Details
Motivation: 探究未经专门训练的专业译者是否具备识别AI生成文本的能力,以评估当前AI文本在专业翻译与编辑场景中的隐蔽性与影响。 Method: 组织69名专业译者参与线下实验,要求其对三篇匿名短篇小说(两篇由ChatGPT-4o生成、一篇人类创作)判断AI作者可能性并提供理由;统计分析判断准确性及所用依据特征。 Result: 16.2%译者显著高于随机水平地正确识别AI文本,主要依据低突发性和叙事矛盾;相近比例出现反向误判;低burstiness和叙事矛盾是最可靠的AI标识,而语法准确性和情感语调易导致误判。 Conclusion: 部分专业译者具备无训练下识别AI文本的能力,但整体可靠性有限;识别效果高度依赖分析性思维与特定语言特征,提示需重新审视AI文本在专业编辑流程中的角色与干预必要性。 Abstract: This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.[35] Determinants of Training Corpus Size for Clinical Text Classification
Jaya Chaturvedi,Saniya Deshpande,Chenkai Ma,Robert Cobb,Angus Roberts,Robert Stewart,Daniel Stahl,Diana Shamsutdinova
Main category: cs.CL
TL;DR: 本文研究了临床文本分类中训练数据量与词汇特性对模型性能的影响,发现600份文档即可达到使用10,000份文档时95%的性能,并揭示强预测词和噪声词数量对学习曲线和准确率的具体影响。
Details
Motivation: 临床NLP文本分类通常需200–500份标注文档,但该数量缺乏对样本量需求及其与文本词汇特性关系的理论依据和实证支持。 Method: 基于MIMIC-III出院小结数据集,采用预训练BERT嵌入+随机森林分类器进行10个ICD-9诊断的二分类任务,系统改变训练规模(100–10,000文档);同时用Lasso逻辑回归分析词袋嵌入,识别强/噪声预测词以关联词汇特性与学习曲线。 Result: 不同诊断任务的学习曲线差异显著;600份文档即可达10,000份时95%的性能上限;每增加100个噪声词使准确率下降约0.02,每增加100个强预测词使最大准确率提升约0.04。 Conclusion: 训练样本量需求高度依赖于任务相关的词汇特性(强/噪声词比例),而非固定经验阈值;该发现可指导更高效、低成本的临床文本标注策略。 Abstract: Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.[36] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers
Francisco Portillo López
Main category: cs.CL
TL;DR: 本研究通过McGurk效应测试AV-HuBERT模型的视听感知生物保真度,发现其在听觉主导率上与人类高度一致,但在语音融合倾向上表现出过强的确定性,缺乏人类感知的随机性与多样性。
Details
Motivation: 评估AV-HuBERT模型在视听整合任务中对人类感知行为(特别是McGurk效应)的建模保真度,探究其是否具备类人多感官处理机制。 Method: 将AV-HuBERT模型对不一致视听刺激(McGurk刺激)的响应与44名人类被试的行为数据进行定量对比,重点分析听觉主导率、语音融合率及错误模式分布。 Result: AV-HuBERT在听觉主导率(32.0% vs. 31.8%)上与人类几乎完全一致,但语音融合率显著更高(68.0% vs. 47.7%),且缺乏人类观察到的感知随机性和多样化错误类型。 Conclusion: 当前自监督视听模型可复现人类多感官整合的宏观结果,但尚未建模神经层面的变异性,提示其感知机制仍为确定性而非概率性。 Abstract: This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.[37] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Chenghao Fan,Wen Heng,Bo Li,Sichen Liu,Yuxuan Song,Jing Su,Xiaoye Qu,Kai Shen,Wei Wei
Main category: cs.CL
TL;DR: 本文提出Stable-DiffCoder,一种基于块扩散的代码生成模型,在相同架构与数据下超越自回归基线,并通过连续预训练与定制噪声调度实现稳定高效训练,提升代码编辑、推理及低资源语言建模能力。
Details
Motivation: 现有基于扩散的代码语言模型(DLLMs)在同等计算预算下仍落后于强自回归(AR)基线,作者旨在系统性重审该设定并提升DLLMs性能。 Method: 提出Stable-DiffCoder,复用Seed-Coder架构、数据与训练流程;引入块扩散连续预训练(CPT)阶段,辅以定制warmup策略和块级裁剪噪声调度;仅使用CPT与监督微调两阶段完成训练。 Result: 在广泛代码基准上整体优于同构AR模型;仅靠CPT+微调即超越多种约8B参数的AR与DLLM;在结构化代码编辑、推理及低资源编程语言上表现更优。 Conclusion: 扩散式训练本身可提升代码建模质量,不依赖更大参数量或额外数据;块扩散与any-order建模对代码结构理解和数据增强具有独特优势。 Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.[38] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech
Soufiane Jhilal,Stéphanie Martin,Anne-Lise Giraud
Main category: cs.CL
TL;DR: 本文提出了一种将脑磁图(MEG)信号转换为时频图像并利用ImageNet预训练视觉模型解码想象言语的新方法,在多项任务中取得优异性能,并揭示了跨被试共享的神经表征及时序特性。
Details
Motivation: 非侵入式想象言语解码面临信号微弱、分布广泛及标注数据稀缺等挑战。 Method: 将21名受试者的MEG信号通过可学习的传感器空间卷积投影为三种空间小波图混合,生成类图像输入,送入ImageNet预训练的视觉模型进行解码。 Result: 在想象言语vs.静默、vs.默读、元音解码任务中分别达到90.4%、81.0%和60.6%的平衡准确率;跨被试评估证实预训练模型能捕获共享神经表征;时间分析定位到与想象锁定的关键时间窗。 Conclusion: 将预训练视觉模型应用于图像化的MEG表征,可有效捕捉非侵入式神经信号中想象言语的结构信息。 Abstract: Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.[39] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
Özgür Uğur,Mahmut Göksu,Mahmut Çimen,Musa Yılmaz,Esra Şavirdi,Alp Talha Demir,Rumeysa Güllüce,İclal Çetin,Ömer Can Sağbaş
Main category: cs.CL
TL;DR: 本文提出了Mecellem模型框架,通过领域自适应策略开发面向土耳其法律领域的专用语言模型,包括从零预训练的高效编码器和基于持续预训练(CPT)的解码器,显著提升检索性能与领域适配能力,同时降低计算开销。
Details
Motivation: 针对土耳其法律领域缺乏高性能、高效率专用语言模型的问题,以及现有SOTA模型依赖多阶段、高计算成本训练流程的局限,亟需一种更轻量、单阶段、领域适配效果好的建模方法。 Method: 提出两部分方法:(1) 从零预训练Turkish-dominant语料(112.7B tokens)上的ModernBERT编码器,采用基于下游检索性能的checkpoint选择策略;(2) 对Qwen3-1.7B/4B解码器实施四阶段可控课程学习的持续预训练(CPT),逐步引入法律术语与长上下文推理能力。 Result: 编码器在土耳其检索排行榜中位列Top-3,155M小模型性能媲美307M–567M大模型,生产效率达92.36%(仅次于三个SOTA);解码器在土耳其法律文本上实现36.2%的困惑度下降。 Conclusion: Mecellem框架验证了单阶段高效预训练+课程式持续适配的有效性,为资源受限的垂直领域语言模型构建提供了可复现、低成本、高性能的新范式。 Abstract: This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.[40] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
Tony Cristofano
Main category: cs.CL
TL;DR: 本文提出了一种跨模型迁移拒绝干预的方法,证明了对齐大语言模型中的拒绝行为源于一种通用的、低维的语义回路。
Details
Motivation: 拒绝行为在对齐的大语言模型中常被视为模型特有现象,但作者假设其根源在于跨模型共享的通用低维语义回路。 Method: 提出Trajectory Replay via Concept-Basis Reconstruction框架:通过概念指纹对齐层、用共享‘概念原子’重建拒绝方向,并引入weight-SVD稳定性保护机制以避免损害模型能力。 Result: 在8组模型对(含GPT-OSS-20B和GLM-4)上验证了该方法能一致削弱拒绝行为且保持模型性能。 Conclusion: 实验结果为安全对齐的语义普遍性提供了有力证据,表明拒绝机制具有跨架构与训练范式的通用性。 Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.[41] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating
Makbule Gulcin Ozsoy
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的多语言Text2Cypher方法,通过训练语言特定的LoRA适配器并采用学习型融合MLP(带动态门控)进行组合,在仅需少量数据和无需全量联合微调的情况下,恢复约75%的联合多语言微调精度,支持高效增量式语言扩展。
Details
Motivation: 现有Text2SQL/SPARQL/Cypher系统多聚焦英文,缺乏可扩展、低开销的多语言支持方案;需避免重复全量微调和手动超参调优,同时保持接近联合多语言微调的性能。 Method: 训练英语、西班牙语、土耳其语各自的LoRA适配器,采用均匀线性融合或带动态门控的学习型融合MLP进行组合;支持仅新增一个LoRA适配器+轻量MLP重训练即可扩展新语言。 Result: 学习型融合MLP在三个语言上均优于线性融合,恢复约75%的联合多语言微调准确率,且仅需子集数据;验证了其在性能、数据效率与可扩展性上的优势。 Conclusion: 学习型适配器融合是一种实用替代方案,可在多语言Text2Cypher任务中平衡性能、数据效率与可扩展性,支持低成本增量语言扩展。 Abstract: Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.[42] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier
Haq Nawaz Malik,Kh Mohmad Shafi,Tanveer Ahmad Reshi
Main category: cs.CL
TL;DR: 本文提出SynthOCR-Gen,一种面向低资源语言的开源合成OCR数据集生成工具,通过将Unicode文本转化为带真实退化效果的图像数据,解决了标注数据稀缺问题,并发布了60万样本的克什米尔语OCR数据集。
Details
Motivation: 低资源语言(如使用波斯-阿拉伯文字的克什米尔语)缺乏大规模标注OCR训练数据,现有OCR系统不支持,而人工标注成本高、耗时且易错。 Method: 开发SynthOCR-Gen工具,包含文本切分(字符/词/n元/句/行)、Unicode规范化与文字纯度保障、多字体渲染及25+种文档退化增强(如旋转、模糊、噪声、扫描伪影)的完整合成流程。 Result: 成功生成并公开发布一个含60万单词样本的克什米尔语OCR合成数据集(HuggingFace),验证了该方法在提升低资源语言OCR性能上的有效性。 Conclusion: SynthOCR-Gen为低资源语言OCR提供了可复用、低成本、高质量的数据生成方案,推动视觉-语言AI覆盖更多边缘化书写系统。 Abstract: Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.[43] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
Alphaeus Dmonte,Vidhi Gupta,Daniel J Perry,Mark Arehart
Main category: cs.CL
TL;DR: 本文首次从效率角度分析了多语言多任务模型合并策略,证明其在保持质量的同时显著降低了训练和维护成本。
Details
Motivation: 现有微调多语言大语言模型的方法需要重新训练整个模型,计算效率低且维护困难,亟需一种更高效的更新策略。 Method: 对多语言多任务模型的合并策略进行系统性效率分析,涵盖三个独立任务,并在公开与私有工业数据集上验证。 Result: 合并策略将初始训练时间减少最多50%;语言更新与重合并使维护训练成本降低超60%;在学术与工业场景中均有效。 Conclusion: 多语言多任务模型合并是一种高效、实用的替代方案,可显著缓解多语言模型的训练与维护瓶颈。 Abstract: Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.[44] Automatic Classification of Arabic Literature into Historical Eras
Zainab Alhathloul,Irfan Ahmad
Main category: cs.CL
TL;DR: 本文利用神经网络和深度学习技术,自动对阿拉伯语文本按历史时期进行分类,填补了该领域研究空白。
Details
Motivation: 阿拉伯语随时间演变显著,但自动按时期分类的研究较少,尤其在非诗歌领域。 Method: 采用神经网络和深度学习方法,在两个公开语料库(OpenITI 和 APCD)构建的数据集上,进行从二分类到多分类(最多15类)的时期分类实验。 Result: 二分类任务F1得分达0.83(OpenITI)和0.79(APCD),而细粒度分类(15类和12类)F1分别降至0.20和0.18。 Conclusion: 深度学习模型在粗粒度阿拉伯文本分期任务中有效,但细粒度分类仍具挑战性,需进一步探索特征与建模方法。 Abstract: The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.[45] LLM-in-Sandbox Elicits General Agentic Intelligence
Daixuan Cheng,Shaohan Huang,Yuxian Gu,Huatong Song,Guoxin Chen,Li Dong,Wayne Xin Zhao,Ji-Rong Wen,Furu Wei
Main category: cs.CL
TL;DR: 本文提出LLM-in-Sandbox框架,使大语言模型能在代码沙盒中自主探索,从而在非编程任务中展现通用智能;无需额外训练即可实现知识获取、长上下文处理等能力,并通过仅用非具身数据的强化学习进一步提升性能,在多学科任务中表现出强泛化性。
Details
Motivation: 现有大语言模型在非代码任务中缺乏主动探索与工具调用能力,限制其通用智能表现;需一种无需额外训练、可复用现有模型并支持真实系统部署的通用智能增强范式。 Method: 提出LLM-in-Sandbox框架,让LLM在隔离代码沙盒中执行文件操作、脚本运行、外部资源调用等行为;设计LLM-in-Sandbox-RL方法,仅使用非具身(non-agentic)监督数据进行强化学习微调,提升沙盒内自主探索能力。 Result: LLM-in-Sandbox在数学、物理、化学、生物医学、长上下文理解及指令遵循等多领域实现鲁棒泛化;训练自由(zero-shot)与微调后设置均显著优于基线;系统分析表明其计算与部署效率良好。 Conclusion: 代码沙盒可作为通用智能的低成本‘认知扩展接口’,LLM-in-Sandbox验证了无需专用训练即可激发和提升LLM的具身式推理与工具使用能力,为构建通用AI系统提供新路径。 Abstract: We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.cs.CV [Back]
[46] AI-Based Culvert-Sewer Inspection
Christina Thrainer
Main category: cs.CV
TL;DR: 本文针对排水系统中涵洞和污水管道缺陷自动分割任务,提出三种应对数据稀缺问题的方法:改进的数据预处理策略、新型轻量级网络FORTRESS,以及基于双向原型网络的少样本语义分割方法,均在有限标注数据下取得显著性能提升。
Details
Motivation: 涵洞和污水管道缺陷检测面临标注数据获取困难、成本高、依赖领域知识等问题,导致大规模标注数据集难以构建,亟需适用于小样本场景的高效分割方法。 Method: 提出了三种方法:1)结合传统数据增强与动态标签注入的预处理策略;2)融合深度可分离卷积、自适应KAN模块和多尺度注意力机制的新型轻量网络FORTRESS;3)采用带注意力机制的双向原型网络实现少样本语义分割。 Result: 所有方法均在涵洞污水管道缺陷数据集上验证有效:预处理策略显著提升IoU和F1分数;FORTRESS达到SOTA性能,同时降低参数量与计算开销;少样本方法在各项指标上取得满意结果。 Conclusion: 本研究系统性地解决了小样本条件下管道缺陷分割难题,通过数据增强、模型架构创新与少样本学习三方面协同优化,为实际工程应用提供了可行、高效且轻量的技术路径。 Abstract: Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.[47] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition
Hatef Otroshi Shahreza,Anjith George,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在异构人脸验证(HFR)任务中的性能,涵盖VIS-NIR、VIS-SWIR和VIS-THERMAL等多种跨模态场景,发现其性能仍显著落后于传统人脸识别系统,尤其在跨光谱条件下,凸显当前MLLMs在生物特征识别应用中的局限性。
Details
Motivation: 探索多模态大语言模型(MLLMs)在异构人脸验证(HFR)这一典型跨模态生物识别任务中的适用性与潜力。 Method: 对多个开源MLLMs在VIS-NIR、VIS-SWIR、VIS-THERMAL等跨模态人脸验证场景下,采用标准生物特征协议(如Acquire Rate、EER、TAR)进行系统性基准测试与评估。 Result: MLLMs在各类跨模态人脸验证任务中性能明显弱于经典专用人脸识别系统,尤其在挑战性跨光谱条件下差距显著;现有MLLMs尚不具备可靠部署于实际HFR系统的性能水平。 Conclusion: 当前MLLMs在异构人脸验证任务中存在明显局限,不能替代专用生物识别模型;将MLLMs应用于生物识别需更严格的评估标准与针对性改进。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.[48] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Pablo Messina,Andrés Villa,Juan León Alcázar,Karen Sánchez,Carlos Hinojosa,Denis Parra,Álvaro Soto,Bernard Ghanem
Main category: cs.CV
TL;DR: CURE是一种无需额外数据的错误感知课程学习框架,通过动态调整采样策略提升医学视觉-语言模型在放射报告生成中的视觉定位准确性和事实一致性。
Details
Motivation: 现有医学视觉-语言模型在放射报告生成中存在视觉定位不准和事实不一致问题,导致文本发现与视觉证据错位。 Method: CURE基于多模态指令模型,在短语定位、定位报告生成和解剖结构定位报告生成三个任务上进行微调,并采用基于模型表现的动态难度采样策略。 Result: CURE将定位准确率(IoU)提升0.37,报告质量(CXRFEScore)提升0.188,幻觉减少18.6%。 Conclusion: CURE是一种数据高效框架,显著提升了医学报告生成的定位准确性与可靠性。 Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure[49] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
Cuong Tran Van,Trong-Thang Pham,Ngoc-Son Nguyen,Duy Minh Ho Nguyen,Ngan Le
Main category: cs.CV
TL;DR: 本文提出DuFal框架,通过双路径架构(频域+空域)提升稀疏视角锥束CT重建中高频解剖细节的恢复能力,核心是高局部因子化傅里叶神经算子与频谱-通道分解、跨注意力频域融合等模块,显著优于现有方法。
Details
Motivation: 稀疏视角CBCT重建因X射线投影不足,难以恢复对应高频成分的精细解剖结构;传统CNN偏向低频学习,性能受限。 Method: 提出DuFal(Dual-Frequency-Aware Learning)框架:包含全局与局部高频率增强的傅里叶神经算子双分支、谱-通道因子化降参策略、跨注意力频域融合模块,以及特征解码器和强度场解码流程。 Result: 在LUNA16和ToothFairy数据集上,DuFal在极稀疏视角下显著优于SOTA方法,尤其在高频解剖特征保真度方面。 Conclusion: 频域与空域协同建模可有效缓解稀疏CT重建中的高频信息丢失问题;DuFal为医学影像重建提供了新范式。 Abstract: Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.[50] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection
Morteza Poudineh,Marc Lalonde
Main category: cs.CV
TL;DR: 本文提出了一种偏差引导的提示学习框架,用于少样本异常检测,通过可学习提示和基于偏差的打分机制提升图像中异常区域的检测与定位能力。
Details
Motivation: 现有方法在少样本异常检测中存在正常/异常提示区分度弱、缺乏合理的块级异常评分机制等问题。 Method: 采用可学习上下文向量替代固定提示前缀,并引入异常特异性后缀;结合Top-K多实例学习的偏差损失,将块特征建模为相对于正常分布的高斯偏差。 Result: 在MVTecAD和VISA数据集上实现了优于PromptAD等基线方法的像素级检测性能,消融实验验证了各模块有效性。 Conclusion: 该框架有效融合视觉语言模型的语义能力与偏差统计的可靠性,提升了少样本设定下异常检测的判别性、定位精度与可解释性。 Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.[51] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
Yunshan Qi,Lin Zhu,Nan Bao,Yifan Zhao,Jia Li
Main category: cs.CV
TL;DR: 本文提出了一种基于传感器物理模型的统一NeRF框架,利用单曝光模糊LDR图像及对应事件数据,实现锐利、高动态范围(HDR)的新视角合成。
Details
Motivation: 现有方法在利用事件数据进行模糊LDR图像的新视角合成时,忽略了相机输出与真实场景辐射之间的传感器物理失配,导致HDR重建和去模糊效果不佳。 Method: 提出传感器物理驱动的NeRF框架:用NeRF直接表征HDR域下的真实场景辐射;建模HDR光线在传感器像素上的物理响应;引入像素级RGB映射场对齐渲染值与LDR输入;设计事件映射场关联场景动态与事件传感器输出;联合优化NeRF与两个映射场。 Result: 在自建与公开数据集上验证了方法有效性,实现了当前最优的单曝光模糊LDR图像+事件数据驱动的去模糊HDR新视角合成性能。 Conclusion: 传感器物理建模是提升基于事件辅助的HDR新视角合成质量的关键,所提统一框架能有效融合LDR图像与事件的时空信息,提升3D表示的清晰度与动态范围。 Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.[52] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis
Jobeal Solomon,Ali Mohammed Mansoor Alsahag,Seyed Sahand Mohammadi Ziabari
Main category: cs.CV
TL;DR: 本研究提出使用Vision Transformer(ViT)替代U-Net编码器构建属性中立化框架,显著降低胸部X光片分类器中的性别/年龄偏见泄漏,同时保持诊断性能。
Details
Motivation: 现有基于卷积编码器的像素级属性中立化方法无法在临床可用编辑强度下完全消除性别和年龄相关捷径导致的系统性偏见。 Method: 将Attribute-Neutral Framework中的U-Net编码器替换为DeiT-S Vision Transformer,在ChestX-ray14数据集上训练;在11个编辑强度(alpha)下生成编辑图像,并用独立AI判别器评估属性泄漏(如性别识别AUC),用ConvNet评估15种疾病的预测性能(ROC AUC)。 Result: 在alpha=0.5时,ViT中立化器将性别识别AUC降至约0.80(比原U-Net方案低10个百分点),且仅训练一半epoch;宏平均ROC AUC下降不超过5个百分点,最差亚组AUC仍维持在约0.70。 Conclusion: 全局自注意力视觉模型(如ViT)可在不损害临床效用前提下进一步抑制人口统计学属性泄漏,为构建更公平的胸部X光AI提供可行路径。 Abstract: Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework's convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.[53] Controllable Layered Image Generation for Real-World Editing
Jinrui Yang,Qing Liu,Yijun Li,Mengwei Ren,Letian Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou
Main category: cs.CV
TL;DR: 本文提出LASAGNA框架,用于生成具有真实视觉效果(如阴影、反射)的分层图像(背景+透明前景),支持多条件输入(文本、掩码等)的可控编辑,并构建了LASAGNA-48K数据集和LASAGNABENCH基准。
Details
Motivation: 现有图像生成模型在编辑特定元素时缺乏可控性和一致性;分层表示虽灵活,但难以生成具备合理合成关系及真实视觉效果(如阴影、反射)的对象层。 Method: 提出LASAGNA统一框架,联合生成背景与带物理真实视觉效果的透明前景层;构建LASAGNA-48K数据集(含清洁背景与RGBA前景);设计首个分层编辑基准LASAGNABENCH;支持文本、前景/背景/位置掩码等多条件输入。 Result: LASAGNA能同时生成高度一致、连贯的多层图像,显著提升身份保持与视觉效果保真度,支持多样化后编辑应用;LASAGNA-48K与LASAGNABENCH将开源。 Conclusion: LASAGNA为可控、高质量分层图像生成与编辑提供了新范式,推动面向真实应用的图像合成研究。 Abstract: Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.[54] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
William Huang,Siyou Pei,Leyi Zou,Eric J. Gonzalez,Ishan Chatterjee,Yang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种利用背侧手部皮肤形变信息的双流delta编码器方法,显著提升了自遮挡场景下的手部姿态估计精度,并支持新型交互范式。
Details
Motivation: XR设备普及使得以自我为中心的手部姿态估计变得重要,但该视角下手指频繁遮挡带来了挑战。 Method: 提出一种双流delta编码器,通过对比动态手部与基线放松姿态的特征来学习姿态,仅使用裁剪的背侧手部图像。 Result: 在自遮挡场景(手指遮挡率≥50%)下,相比依赖整手几何结构和大模型主干的SOTA方法,MPJAE降低18%,并提升了捏取、点击等下游任务的可靠性,同时支持无可见运动的等长力检测。 Conclusion: 该方法不仅提高了遮挡场景下手部姿态估计的鲁棒性,还推动了轻量化模型与新型交互方式的发展。 Abstract: The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.[55] VIOLA: Towards Video In-Context Learning with Minimal Annotations
Ryo Fujii,Hideo Saito,Ryo Hachiuma
Main category: cs.CV
TL;DR: 本文提出VIOLA框架,通过密度-不确定性加权采样和置信度感知检索/提示,在极少量专家标注下实现多模态大模型对新视频域的高效上下文学习适应。
Details
Motivation: 现有视频领域迁移方法依赖大量标注数据,但在工业、手术等专业场景中专家标注成本高、难以获取,亟需低标注成本的训练-free适配方法。 Method: 提出VIOLA框架:1)密度-不确定性加权采样,兼顾多样性、代表性和信息性;2)构建混合示范池,引入置信度感知检索(结合相似性与置信度打分)和置信度感知提示(区分真实标签与噪声伪标签)。 Result: 在9个视频基准、4个多模态大模型上验证,VIOLA在低资源设置下显著优于各类基线,以极低标注成本实现鲁棒域适应。 Conclusion: VIOLA证明了仅需极少专家标注即可有效提升MLLM在新视频领域的泛化能力,为现实场景中模型快速部署提供了可行路径。 Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.[56] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation
Sylvey Lin,Eranki Vasistha
Main category: cs.CV
TL;DR: 本文提出了一种新的评估指标RCA,用于衡量类条件DDPM在K-pop偶像人脸生成任务中的语义可控性,发现模型虽视觉质量高但存在严重的语义模式崩溃问题。
Details
Motivation: 标准评估指标(如FID和IS)难以检测细粒度、单领域任务中的身份错位问题,亟需更合适的语义可控性评估方法。 Method: 针对K-pop偶像人脸生成(32x32)这一高类间相似性任务,构建类条件DDPM,并提出归一化于理想分类器基线的评估指标——相对分类准确率(RCA),结合混淆矩阵分析失败模式。 Result: 模型FID为8.93(视觉质量高),但RCA仅为0.27(语义模式崩溃严重),尤其在视觉模糊身份上;失败原因归结为分辨率限制与性别内歧义。 Conclusion: RCA为条件生成模型的身份一致性验证提供了严格新标准,揭示了高保真生成与语义可控性之间的重要权衡。 Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier's baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.[57] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition
Weiwei Wu,Yueyang Li,Yuhu Shi,Weiming Zeng,Lang Qin,Yang Yang,Ke Zhou,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang
Main category: cs.CV
TL;DR: 本文提出RSM-CoDG框架,结合脑区先验、多尺度时序建模与协同域泛化策略,提升跨被试EEG情绪识别的鲁棒性与泛化能力。
Details
Motivation: 跨被试EEG情绪识别面临被试间变异性大、分布偏移严重及情绪神经表征时空复杂度高的挑战;现有方法在空间、时序或泛化策略上孤立优化,难以统一建模多尺度动态并抑制被试特异性偏差。 Method: 提出Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization(RSM-CoDG):1)基于功能脑区分区构建区域级空间表征;2)采用多尺度时序建模刻画情绪神经活动动态演化;3)设计协同域泛化策略,通过多维约束抑制未知被试的个体偏差。 Result: 在SEED系列数据集上,RSM-CoDG持续优于现有方法,显著提升跨被试情绪识别准确率与模型鲁棒性。 Conclusion: RSM-CoDG通过融合神经科学先验与协同域泛化机制,在统一框架内有效对齐跨被试表征、建模多尺度时空动态并抑制个体偏差,为EEG情绪识别提供了更具泛化能力的解决方案。 Abstract: Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at https://github.com/RyanLi-X/RSM-CoDG.[58] Explainable Deepfake Detection with RL Enhanced Self-Blended Images
Ning Jiang,Dingheng Zeng,Yanhong Liu,Haiyang Yi,Shijie Yu,Minghe Weng,Haifeng Shen,Ying Li
Main category: cs.CV
TL;DR: 本文提出了一种基于自融合图像的自动化思维链(CoT)数据生成框架和强化学习(RL)增强的深度伪造检测框架,以降低高质量标注数据需求并提升跨域泛化能力,在多个跨数据集基准上达到SOTA水平。
Details
Motivation: 现有深度伪造检测方法缺乏可解释性;多模态大语言模型(MLLMs)虽有潜力,但受限于高成本、难获取的细粒度伪造归因文本标注;同时,强化学习在视觉任务尤其是跨域泛化中展现出优势,值得探索其在该任务中的应用。 Method: 提出基于自融合图像(Self-Blended Images)的自动化Chain-of-Thought(CoT)数据生成框架,结合强化学习增强的检测框架,包含定制化奖励机制与反馈驱动的合成数据生成。 Result: 所提CoT数据构建流程、奖励机制与合成数据生成方法在实验中被验证有效;在多个跨数据集基准上性能媲美当前最优方法(SOTA)。 Conclusion: 本工作为降低MLLM在深度伪造检测中对人工标注的依赖提供了可行路径,并证实了强化学习在提升模型可解释性与泛化能力方面的有效性。 Abstract: Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.[59] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception
Bo Yuan,Danpei Zhao,Wentao Li,Tian Li,Zhiguo Jiang
Main category: cs.CV
TL;DR: 本文提出持续全景感知(CPP)框架,将多模态与多任务持续学习结合,通过跨模态协同编码器、可塑知识继承模块、跨模态一致性约束及非对称伪标签策略,缓解灾难性遗忘与语义混淆问题,提升像素级、实例级和图像级联合感知能力。
Details
Motivation: 现有持续学习研究主要集中于单任务场景,难以应对多任务与多模态场景下的语义混淆与灾难性遗忘问题,限制了智能感知系统的综合能力。 Method: 提出持续全景感知(CPP)模型,包含协同跨模态编码器(CCE)、基于对比特征蒸馏与实例蒸馏的可塑知识继承模块、跨模态一致性约束,并扩展为CPP+;引入非对称伪标签机制,避免样本回放。 Result: 在多模态数据集与多样化持续学习任务上实验表明,该模型尤其在细粒度持续学习任务中表现优越。 Conclusion: CPP框架有效整合多模态与多任务持续学习,显著提升全景感知能力,为智能感知AI系统提供新范式。 Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.[60] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction
Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao
Main category: cs.CV
TL;DR: 本文提出了SuperOcc,一种基于超二次曲面的3D占用预测新框架,通过协同时间建模、多超二次曲面解码和高效体素溅射,解决了现有方法在时间建模不足、稀疏性与几何表达力权衡困难及计算效率低等问题,在SurroundOcc和Occ3D上达到SOTA性能且更高效。
Details
Motivation: 现有3D占用预测方法多采用密集场景表示,忽视真实驾驶场景的稀疏性;而新兴的3D超二次曲面表示虽具强几何表达力和稀疏优势,但仍存在时间建模不足、查询稀疏性与几何表达力难以兼顾、超二次曲面到体素溅射效率低等问题。 Method: 提出SuperOcc框架,包含三个核心设计:(1) 协同式时间建模机制,同时利用视图中心和物体中心的时间线索;(2) 多超二次曲面解码策略,在不牺牲查询稀疏性的前提下提升几何表达力;(3) 高效的超二次曲面到体素溅射方案以提升计算效率。 Result: 在SurroundOcc和Occ3D基准上取得SOTA性能,同时保持更高计算效率。 Conclusion: SuperOcc有效克服了当前超二次曲面占用预测方法的关键局限,在精度与效率间取得更好平衡,验证了稀疏几何表示在3D占用预测中的潜力。 Abstract: 3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at https://github.com/Yzichen/SuperOcc.[61] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams
Zhenghui Guo,Yuanbin Man,Junyuan Sheng,Bowen Lin,Ahmed Ahmed,Bo Jiang,Boyuan Zhang,Miao Yin,Sian Jin,Omprakash Gnawal,Chengming Zhang
Main category: cs.CV
TL;DR: Event-VStream 提出一种事件感知的视频流理解框架,通过检测语义连贯的离散事件边界来触发生成,并构建持久化记忆库以支持长时序推理,在多个实时长视频基准上显著提升性能。
Details
Motivation: 现有VLM在实时长视频理解中面临冗余帧处理和快速遗忘历史上下文的问题,固定间隔解码或缓存剪枝策略难以兼顾效率与信息完整性。 Method: 提出Event-VStream框架,利用运动、语义与预测线索联合检测视频中的语义状态跃迁(即事件边界),仅在边界处触发语言生成;将每个事件编码为嵌入并存入持久化记忆库,支持低延迟下的长程推理。 Result: 在OVOBench-Realtime上较VideoLLM-Online-8B提升+10.4分;性能接近专用Flash-VStream-7B,但仅使用通用LLaMA-3-8B文本主干;在2小时Ego4D流上保持约70% GPT-5胜率。 Conclusion: 事件驱动的稀疏处理与持久化记忆机制可有效平衡实时性与长时序理解能力,为流式多模态大模型提供新范式。 Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.[62] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling
Hongyang Wei,Hongbo Liu,Zidong Wang,Yi Peng,Baixin Xu,Size Wu,Xuying Zhang,Xianglong He,Zexiang Liu,Peiyu Wang,Xuchen Song,Yangguang Li,Yang Liu,Yahui Zhou
Main category: cs.CV
TL;DR: 本文提出了Skywork UniPic 3.0,一个统一的多模态框架,支持任意数量和分辨率的输入图像(1~6张,总像素≤1024×1024)进行单图编辑与多图合成(尤其聚焦于人-物交互HOI任务),通过创新的数据流水线、序列建模范式及后训练加速策略(8步采样、12.5倍加速),在单图编辑和多图合成任务上均达到SOTA性能。
Details
Motivation: 社区对多图合成(尤其是HOI类)需求高涨,但现有模型缺乏高质量融合的方法细节,亟需系统性解决方案。 Method: 提出统一多模态框架Skywork UniPic 3.0;构建面向HOI的高质量数据采集-过滤-合成流水线(仅700K样本);将多图合成建模为序列生成问题;引入轨迹映射与分布匹配的后训练加速机制。 Result: 在单图编辑基准上达SOTA,在多图合成基准上显著超越Nano-Banana和Seedream 4.0;实现8步高保真生成,推理速度提升12.5倍。 Conclusion: 所提出的统一框架、数据策略与序列化训练范式有效解决了多图合成中的一致性与质量难题,验证了其通用性与高效性。 Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.[63] Consistency-Regularized GAN for Few-Shot SAR Target Recognition
Yikui Zhai,Shikuang Liu,Wenlve Zhou,Hongsheng Zhang,Zhiheng Zhou,Xiaolin Tian,C. L. Philip Chen
Main category: cs.CV
TL;DR: 本文提出了一种一致性正则化生成对抗网络(Cr-GAN),用于在极有限样本下合成高质量SAR图像,以缓解少样本识别难题;其双分支判别器、通道级特征插值与双域循环一致性机制显著提升了小样本下的泛化能力,并在MSTAR和SRSDD数据集上取得领先性能。
Details
Motivation: SAR图像少样本识别因数据极度稀缺而受限,而传统GAN需大量数据训练,与少样本前提矛盾,亟需一种能在极少样本下稳定生成高质量图像的新方法。 Method: 提出Cr-GAN框架:采用双分支判别器解耦对抗训练与表征学习;引入通道级特征插值生成新潜在特征;设计双域(图像域与特征域)循环一致性机制保障语义一致性;支持多种GAN主干并兼容多种自监督学习算法。 Result: 在MSTAR和SRSDD数据集的8-shot设置下分别达到71.21%和51.64%准确率,显著优于现有基线,且参数量仅为前沿扩散模型的约1/5。 Conclusion: Cr-GAN有效解决了少样本SAR图像生成与识别中的数据-模型矛盾,为低数据场景下的雷达图像理解提供了可扩展、高效率的新范式。 Abstract: Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: https://github.com/yikuizhai/Cr-GAN.[64] Performance-guided Reinforced Active Learning for Object Detection
Zhixuan Liang,Xingyu Zeng,Rui Zhao,Ping Luo
Main category: cs.CV
TL;DR: 本文提出了一种面向目标检测任务的性能引导型强化主动学习方法MGRAL,以mAP提升为奖励信号,利用强化学习代理选择最具信息量的样本批次,并通过快速查表法降低mAP估计开销。
Details
Motivation: 现有主动学习方法评估数据信息量时未直接关联下游任务性能(如目标检测中的mAP),导致标注效率与模型性能提升脱节。 Method: 提出MGRAL框架:以模型输出变化期望作为信息量度量;采用基于策略梯度的强化学习采样代理解决批量选择的组合爆炸与mAP不可微问题;引入无监督的快速查表法近似mAP以降低计算开销。 Result: 在PASCAL VOC和COCO数据集上的目标检测主动学习任务中,MGRAL取得了最优AL曲线,并提供了有说服力的可视化结果。 Conclusion: MGRAL建立了强化学习驱动的目标检测主动学习新范式,实现了标注效率与下游性能提升的更好协同。 Abstract: Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data's distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL's active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.[65] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
Mingyu Yu,Lana Liu,Zhehao Zhao,Wei Wang,Sujuan Qin
Main category: cs.CV
TL;DR: 本文提出了Beyond Visual Safety (BVS)框架,通过‘重建-生成’策略对多模态大语言模型(MLLMs)的视觉安全边界进行越狱攻击,成功率达98.21%。
Details
Motivation: 现有研究对MLLMs的文本安全关注较多,但对其视觉安全边界的探究不足,亟需系统性评估其视觉内容对齐的鲁棒性。 Method: 提出BVS框架,采用‘重建-然后生成’策略,结合中性化图像拼接与归纳式重组,将恶意意图从原始输入中解耦,诱导MLLMs生成有害图像。 Result: BVS在GPT-5(2026年1月发布版)上实现98.21%的越狱成功率,显著高于基线方法。 Conclusion: 当前MLLMs在视觉安全对齐方面存在严重缺陷,BVS揭示了其在跨模态安全机制上的关键漏洞,为后续防御研究提供重要依据。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.[66] Enhanced LULC Segmentation via Lightweight Model Refinements on ALOS-2 SAR Data
Ali Caglayan,Nevrez Imamoglu,Toru Kouyama
Main category: cs.CV
TL;DR: 本文提出了一种针对日本全国尺度ALOS-2 SAR数据的轻量级语义分割方法,改进了边界模糊、细长结构漏检和长尾类别性能差等问题,在LULC分类与水体检测任务上均取得提升。
Details
Motivation: 解决SAR图像密集预测中的常见问题:边界过度平滑、细长结构漏检、长尾标签下稀有类别性能下降,且不增加流程复杂度。 Method: 基于SAR-W-MixMAE自监督预训练,引入三项轻量改进:(i) 高分辨率特征注入多尺度解码;(ii) 渐进式 refine-up 解码头(交替卷积精炼与逐步上采样);(iii) 在focal+dice损失中引入α缩放因子调节类别重加权。 Result: 在全日本ALOS-2 LULC基准上实现一致提升,尤其改善稀有类别的表现,并在标准指标下提升水体检测精度。 Conclusion: 所提轻量改进有效缓解SAR语义分割的关键缺陷,在保持模型简洁性的同时显著提升长尾类别与细粒度结构的识别能力。 Abstract: This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $α$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.[67] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework
Shubham Shukla,Kunal Sonalkar
Main category: cs.CV
TL;DR: 本文提出一个三层次评估框架,用于系统评测视觉-语言模型(VLMs)在细粒度时尚属性预测任务中的表现,特别关注属性适用性检测(如‘外层织物’在无外套时不可用)与细粒度分类的解耦分析;在DeepFashion-MultiModal数据集上评测9个VLM,发现零样本VLM显著优于传统嵌入+分类器方法,但适用性检测(NA-F1仅34.1%)是主要瓶颈,而高效模型可达到旗舰模型90%性能。
Details
Motivation: 现有VLM在时尚多属性预测中缺乏系统评估,且时尚属性具有条件性(如某些属性仅在特定服饰存在时才有意义),需先判断属性是否适用(applicability detection),再分类,但该关键环节未被充分建模和评测。 Method: 提出三层次评估框架:(1)整体任务性能(含NA类);(2)属性适用性检测(是否为NA);(3)适用前提下的细粒度分类。在DeepFashion-MultiModal数据集(显式标注NA)上,评测9个VLM(覆盖旗舰/高效/超高效三档)及基于Fashion-CLIP嵌入的基线分类器。 Result: (1)零样本VLM宏观F1达64.0%,是Fashion-CLIP+逻辑回归基线的三倍;(2)VLM在细粒度分类(Tier 3)达70.8% F1,但在适用性检测(Tier 2)仅34.1% NA-F1,暴露核心瓶颈;(3)高效模型(如GPT-5 Mini)性能达旗舰模型90%以上,性价比更优。 Conclusion: 属性适用性检测是当前VLM在时尚细粒度预测中的关键短板;所提三层次框架可精准定位错误来源(可见性判断 vs 分类错误),为工业级系统优化提供诊断依据;高效VLM已具备实用部署价值。 Abstract: Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.[68] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
Chenglin Li,Qianglong Chen,Feng Han,Yikun Wang,Xingxi Yin,Yan Gong,Ruilin Li,Yin Zhang,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出VideoThinker,一种基于合成工具交互轨迹训练的代理式视频大语言模型,通过在字幕空间生成多步工具使用序列并映射回视频帧,构建大规模视频-工具推理数据集,从而实现对长视频的动态推理与自适应时序探索。
Details
Motivation: 现有视频大语言模型依赖均匀采样帧进行静态推理,导致长视频中时序定位能力弱、信息损失严重;而构建代理式视频理解数据又需要已具备强长视频理解能力的模型,形成循环依赖。 Method: 提出VideoThinker模型,将视频转为丰富字幕,利用强代理式语言模型在字幕空间生成多步工具使用序列(如时序检索、空间/时序缩放),再将字幕替换为对应视频帧,构建无需真实长视频理解能力即可生成的合成视频-工具推理数据集,并在此数据上训练模型。 Result: VideoThinker在长视频基准测试中显著超越纯字幕语言模型代理和强视频模型基线,展现出工具增强合成数据与自适应检索/缩放推理的有效性。 Conclusion: 基于合成工具交互轨迹的训练范式可有效打破代理视频理解的数据构建瓶颈,赋予模型动态推理、自适应时序探索和多步工具协同能力,为长视频理解提供新路径。 Abstract: Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.[69] FAIR-ESI: Feature Adaptive Importance Refinement for Electrophysiological Source Imaging
Linyong Zou,Liang Zhang,Xiongfei Wang,Jia-Hong Gao,Yi Sun,Shurong Sheng,Kuntao Xiao,Wanli Yang,Pengfei Teng,Guoming Luan,Zhao Lv,Zikang Xu
Main category: cs.CV
TL;DR: 本文提出FAIR-ESI框架,通过多视角自适应特征重要性精炼(频谱、时域、注意力机制)提升脑电生理源成像精度,并在模拟与临床数据上验证其有效性。
Details
Motivation: 准确选择和精炼特征是实现精确脑电生理源成像(ESI)的关键挑战,现有模型优化与深度学习方法仍受限于此。 Method: 提出FAIR-ESI框架,包含FFT频谱特征精炼、加权时域特征精炼和自注意力驱动的块级特征精炼三个模块,实现跨视角的自适应特征重要性优化。 Result: 在两个模拟数据集和两个真实临床数据集上实验表明,FAIR-ESI显著提升ESI定位精度与鲁棒性。 Conclusion: FAIR-ESI为脑疾病诊断提供了更精准的源成像工具,并有助于深入理解脑功能机制。 Abstract: An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework's efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.[70] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation
Shadi Alijani,Fereshteh Aghaee Meibodi,Homayoun Najjaran
Main category: cs.CV
TL;DR: 本文提出了一种面向多模态医学影像的基座模型适配新框架,包含子区域感知的模态注意力与自适应提示工程,显著提升了脑肿瘤分割(尤其是坏死核心区域)性能。
Details
Motivation: 现有基座模型在多模态医学影像中难以有效融合多源信息并适应病理组织的异质性。 Method: 提出子区域感知的模态注意力机制和自适应提示工程策略,使模型能为不同肿瘤子区域动态选择最优模态组合,并利用基座模型固有能力提升分割精度。 Result: 在BraTS 2020数据集上验证,所提方法显著优于基线方法,尤其在坏死核心子区域分割上表现突出。 Conclusion: 该工作为多模态医学影像中的基座模型融合与提示设计提供了原理清晰且高效可行的新范式。 Abstract: The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.[71] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework
Xinjue Hu,Chi Wang,Boyu Wang,Xiang Zhang,Zhenshan Tan,Zhangjie Fu
Main category: cs.CV
TL;DR: 本文提出ARDIS,首个任意分辨率深度图像隐写框架,通过频域解耦架构和隐式重建器解决传统方法中因分辨率不匹配导致的细节丢失问题,并支持盲恢复。
Details
Motivation: 现有深度图像隐写方法要求秘密图像与载体图像分辨率一致,导致不同分辨率的秘密图像需预采样(造成细节损失)且无法在未知分辨率时恢复原分辨率。 Method: 提出ARDIS框架:1)隐藏阶段采用频率解耦架构,将秘密图像分解为分辨率对齐的全局基和分辨率无关的高频隐表示;2)恢复阶段使用潜在引导的隐式重建器,通过连续隐函数渲染高频残差;3)引入隐式分辨率编码策略实现盲恢复。 Result: ARDIS在不可见性和跨分辨率恢复保真度上显著优于现有最先进方法。 Conclusion: ARDIS成功将深度图像隐写范式从离散映射转向参考引导的连续信号重建,解决了分辨率不匹配带来的核心挑战,并实现了任意分辨率下的高保真、盲恢复。 Abstract: Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret's resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.[72] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification
Yimin Zhu,Lincoln Linlin Xu,Zhengsen Xu,Zack Dewis,Mabel Heffring,Saeid Taleghanidoozdoozan,Motasem Alkayid,Quinn Ledingham,Megan Greenwood
Main category: cs.CV
TL;DR: 本文提出了一种物理光谱感知的白盒超连接框架ES-mHC,用于高光谱图像分类,通过结构化、有向矩阵显式建模不同电磁波谱分组间的交互,提升模型可解释性与内部机制理解。
Details
Motivation: 现有深度学习模型在高光谱图像分类中依赖不透明的光谱-空间特征混合,导致可解释性差、内部决策机制难以理解。 Method: 提出ES-mHC框架,将特征表示与交互结构分离,利用残差流中的超连接矩阵显式建模电磁波谱分组间的方向性交互,并支持可视化与空间分析。 Result: 实验表明所学超连接矩阵呈现一致的空间模式和非对称交互行为;扩展率提高可加速结构化交互模式的出现。 Conclusion: ES-mHC将高光谱图像分类从纯黑箱预测转变为结构透明、部分白箱的学习过程。 Abstract: In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.[73] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)
Qi Zeng,Weide Liu,Bo Li,Ryne Didier,P. Ellen Grant,Davood Karimi
Main category: cs.CV
TL;DR: 本文提出了FeTal-SAM,一种专为胎儿脑MRI分割定制的Segment Anything Model(SAM)新变体,通过结合图谱引导提示与基础模型思想,提升了模型在不同标签定义下的灵活性和可解释性。
Details
Motivation: 传统深度学习方法需大量标注数据且标签固定,难以适应临床或研究需求变化;胎儿脑MRI分割中还存在需反复重训练及分割依据不明确(图像对比度 vs. 空间先验)两大问题。 Method: 利用多图谱配准生成空间对齐的标签模板作为密集提示,并结合边界框提示输入SAM分割解码器,实现单结构二值分割,再融合重建完整3D分割体积。 Result: 在dHCP和内部数据集上验证,对皮层板、小脑等高对比结构Dice分数媲美针对特定数据集/标签训练的SOTA基线;支持用户自定义解剖结构分割;对海马、杏仁核等低对比结构精度略低。 Conclusion: FeTal-SAM是一种无需频繁重训练的通用胎儿脑MRI分割模型,显著提升临床适应性,是迈向可部署临床分析工具的重要一步。 Abstract: This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM's segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM's robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM's potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.[74] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps
Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Jingrui Zhang,Kefei Qian,Wenbo Chu,Keqiang Li
Main category: cs.CV
TL;DR: 本文提出LL-GaussianMap,首个将2D高斯溅射(2DGS)引入低光照图像增强的无监督框架,通过显式结构建模生成增益图,在保持边缘、抑制伪影的同时避免配对数据依赖。
Details
Motivation: 现有低光照增强方法多在像素域或隐式特征空间操作,忽视图像固有的几何结构先验;而2DGS虽具优异结构拟合与渲染效率,却未被探索用于底层视觉任务。 Method: 提出两阶段无监督框架:第一阶段用2DGS进行高保真结构重建;第二阶段通过高斯光栅化机制在统一增强模块中渲染数据驱动的增强字典系数,以生成结构感知的增益图。 Result: LL-GaussianMap在增强性能上优于现有方法,同时具有极低存储开销,验证了显式高斯表示在图像增强中的有效性。 Conclusion: 将2DGS引入低光照增强是可行且有效的,显式结构建模可显著提升增强质量与结构保真度,为无监督低层视觉任务提供了新范式。 Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.[75] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting
Yuhan Chen,Wenxuan Yu,Guofa Li,Yijun Xu,Ying Fang,Yicui Shi,Long Cao,Wenbo Chu,Keqiang Li
Main category: cs.CV
TL;DR: 本文提出了LL-GaussianImage,首个在2D高斯泼溅(2DGS)压缩表示域中直接进行低光增强的零样本无监督框架,避免了传统解压-增强-再压缩流程带来的效率损失和二次退化。
Details
Motivation: 现有低光增强算法主要在像素域操作,处理2DGS压缩图像需繁琐的解压-增强-再压缩流程,导致效率低和二次退化。 Method: 提出语义引导的MoE增强框架、多目标协同损失函数系统、两阶段优化过程,在2DGS稀疏属性空间中实现压缩即增强、重建即增强。 Result: 在保持高压缩比的同时实现了高质量低光图像增强,实验验证了直接在压缩表示域处理的可行性与优越性。 Conclusion: LL-GaussianImage开创了在显式场景压缩表示域中直接进行低光增强的新范式,兼顾效率、质量与压缩率。 Abstract: 2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.[76] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation
Liuyun Jiang,Yanchao Zhang,Jinyue Guo,Yizhuo Lu,Ruining Zhou,Hua Han
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的数据增强框架,用于电子显微镜下的神经元分割,通过分辨率感知的条件扩散模型和生物学引导的掩码重塑模块,生成结构多样且真实的图像-标签对,在低标注数据场景下显著提升分割性能。
Details
Motivation: 现有深度学习方法依赖大量人工标注数据,而传统数据增强方法生成样本结构多样性不足,难以满足神经元分割对结构真实性的高要求。 Method: 提出一种扩散驱动的数据增强框架:1)构建分辨率感知的多尺度条件扩散模型,结合EM分辨率先验,实现从3D掩码到体素级图像的合成;2)设计生物学引导的掩码重塑模块,提升增强掩码的结构真实性。 Result: 在AC3和AC4低标注数据集上,结合两种后处理方法,ARAND指标分别提升32.1%和30.7%。 Conclusion: 该扩散增强框架能有效缓解标注稀缺问题,提升神经元分割精度,为小样本生物图像分析提供了新思路。 Abstract: Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at https://github.com/HeadLiuYun/NeuroDiff.[77] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Pascal Benschop,Justin Dauwels,Jan van Gemert
Main category: cs.CV
TL;DR: 本文提出了一种用于评估视觉语言模型(VLMs)空间推理能力的合成视频基准,聚焦于情境意识(如识别暴力/良性行为)与空间意识(如角色绑定、轨迹对齐),发现当前VLMs表现仅略高于随机水平,稳定颜色线索可部分缓解角色混淆但无法根本解决空间推理薄弱问题。
Details
Motivation: 现有VLMs在依赖细微时间或几何线索的空间推理任务上表现脆弱,缺乏系统性、可控的评估手段。 Method: 构建一个基于最小视频对的合成基准,涵盖三类挑战:暴力与良性活动区分、跨视角施害者角色绑定、细粒度运动轨迹对齐;在零训练(training-free)设定下评估主流VLMs,并引入稳定颜色线索作为辅助分析。 Result: 所有被测VLMs在各项任务中性能均仅略高于随机水平;添加稳定颜色线索可部分缓解施害者角色混淆,但无法显著提升整体空间推理能力。 Conclusion: 当前VLMs的空间推理能力严重不足,亟需结合轻量级空间先验与大规模预训练以提升鲁棒性;本工作开源数据与代码,旨在推动该方向的可复现诊断与方法探索。 Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.[78] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks
Mustafa Yurdakul,Enes Ayan,Fahrettin Horasan,Sakir Tasdemir
Main category: cs.CV
TL;DR: 本文提出了一种基于CNN的移动应用,用于非专业人士快速识别花卉种类,通过对比MobileNet、DenseNet121和Xception三种模型及七种优化算法,发现DenseNet121配合SGD最优,准确率达95.84%。
Details
Motivation: 花卉识别通常需要专家知识,但专家资源难以随时获取,因此需开发便捷、高效的自动识别工具。 Method: 构建基于MobileNet、DenseNet121和Xception三种CNN模型的移动应用,并分别使用七种优化算法训练评估其性能。 Result: DenseNet121结合SGD优化算法表现最佳,准确率、精确率、召回率和F1分数均达约96%。 Conclusion: CNN模型(尤其是DenseNet121)适用于移动端花卉分类任务,具备实用性和推广价值。 Abstract: A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.[79] Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data
Clare Chemery,Hendrik Edelhoff,Ludwig Bothmann
Main category: cs.CV
TL;DR: 本文提出了一种轻量级机器学习实验流水线,帮助生态学家无需深厚ML背景即可自主构建适用于本地数据和特定任务的图像分类模型,并在红鹿年龄与性别分类任务中验证了其有效性。
Details
Motivation: 降低生态学家应用机器学习进行图像分类的门槛,使其摆脱对现成模型的依赖,能针对本地数据和具体研究问题定制化建模。 Method: 开发了一个结合命令行接口(用于预处理、训练、评估)与图形界面(用于标注、错误分析、模型对比)的轻量级ML实验流水线;在红鹿相机陷阱图像数据集上测试多种骨干网络、超参数及数据增强策略。 Result: 在3392张原始图像、4352张专家标注的裁剪图像上,最佳模型在年龄分类达90.77%准确率,性别分类达96.15%准确率。 Conclusion: 即使数据有限,针对明确定义的生态问题,仍可构建高可靠性的专用分类器;该框架为野生动物监测与种群 demographics 分析提供了易用、可扩展的ML工具,有助于推动ML在生态学中的普及应用。 Abstract: We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.[80] Towards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion
Yonghao Xu,Pedram Ghamisi,Qihao Weng
Main category: cs.CV
TL;DR: 本文首次将数据集蒸馏引入遥感图像解译领域,利用文本到图像扩散模型压缩大规模遥感数据集,并提出分类器驱动引导与潜在空间聚类策略提升合成样本判别性与多样性。
Details
Motivation: 解决遥感图像深度学习依赖大规模标注数据带来的高存储计算开销和敏感数据泄露风险。 Method: 基于文本到图像扩散模型进行数据集蒸馏;引入预训练分类器的分类一致性损失实现分类器驱动引导;在潜在空间聚类选取代表性原型作为视觉风格指导,并用视觉语言模型生成聚合文本描述。 Result: 在三个高分辨率遥感场景分类基准上验证了方法可蒸馏出逼真、多样的样本,有效支持下游模型训练。 Conclusion: 数据集蒸馏是降低遥感图像深度学习数据依赖的有效新范式,所提方法兼顾真实性、判别性与语义多样性。 Abstract: Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (https://github.com/YonghaoXu/DPD).[81] An IoT-Based Smart Plant Monitoring and Irrigation System with Real-Time Environmental Sensing, Automated Alerts, and Cloud Analytics
Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
Main category: cs.CV
TL;DR: 本文提出了一种基于IoT的智能植物监测系统,利用ESP32和多种传感器实现环境参数实时采集、自动灌溉与云平台远程监控,显著提升水分利用效率与植物健康管理水平。
Details
Motivation: 全球对可持续农业的需求日益增长,传统耕作方式存在水资源浪费、植物生长不均及环境响应滞后等问题。 Method: 采用ESP32微控制器集成DHT22、HC-SR04和土壤湿度传感器,结合OLED显示、蜂鸣器报警及ThingSpeak云平台实现数据采集、传输、分析与可视化。 Result: 系统在维持土壤湿度方面准确率达92%,节水约40%,具备实时监测能力,并以45.20美元低成本实现可扩展部署。 Conclusion: 该系统是一种经济、可靠且可扩展的精准农业解决方案,适用于家庭园艺与商业农业场景。 Abstract: The increasing global demand for sustainable agriculture necessitates intelligent monitoring systems that optimize resource utilization and plant health management. Traditional farming methods rely on manual observation and periodic watering, often leading to water wastage, inconsistent plant growth, and delayed response to environmental changes. This paper presents a comprehensive IoT-based smart plant monitoring system that integrates multiple environmental sensors with automated irrigation and cloud analytics. The proposed system utilizes an ESP32 microcontroller to collect real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with visual feedback through an OLED display and auditory alerts via a buzzer. All sensor data is wirelessly transmitted to the ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alert generation. Experimental results demonstrate the system's effectiveness in maintaining optimal soil moisture levels (with 92\% accuracy), providing real-time environmental monitoring, and reducing water consumption by approximately 40\% compared to conventional irrigation methods. The integrated web dashboard offers comprehensive visualization of plant health parameters, making it suitable for both small-scale gardening and commercial agriculture applications. With a total implementation cost of \$45.20, this system provides an affordable, scalable solution for precision agriculture and smart farming.[82] TinySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing
Toan Gian,Dung T. Tran,Viet Quoc Pham,Francesco Restuccia,Van-Dinh Nguyen
Main category: cs.CV
TL;DR: TinySense提出了一种基于VQGAN的高效Wi-Fi CSI数据压缩框架,用于提升设备无感、隐私保护式人体姿态估计(HPE)的可扩展性,在保持高HPE精度的同时显著降低延迟与网络开销。
Details
Motivation: 现有Wi-Fi感知方法直接处理大量CSI数据,消耗过多网络资源,难以满足设备无感和隐私保护的人体姿态估计需求。 Method: 提出TinySense框架:采用向量量化生成对抗网络(VQGAN)学习紧凑码本;结合K-means动态聚类码本以自适应调整压缩比特率;引入Transformer模型缓解比特率损失,增强网络不稳定下的鲁棒性;在Jetson Nano和Raspberry Pi上实现原型验证。 Result: 相比SOTA压缩方案,在相同压缩率下PCK20指标最高提升1.5倍;延迟最高降低5倍,网络开销最高降低2.5倍。 Conclusion: TinySense在保障HPE精度前提下,显著提升了Wi-Fi感知系统的压缩效率、实时性与网络友好性,为边缘部署提供了可行路径。 Abstract: With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.[83] A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies
Jingsong Xia,Siqi Wang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、受大脑启发的深度学习框架,用于冠状动脉造影(CAG)图像的二分类,通过选择性神经可塑性训练、注意力调制损失函数及类不平衡感知采样等策略,在资源受限下实现高鲁棒性与泛化性。
Details
Motivation: 现实临床中冠脉造影图像存在病灶形态复杂、类别严重不平衡、标签不确定性高及计算资源有限等问题,传统深度学习方法鲁棒性和泛化性不足。 Method: 基于预训练CNN构建轻量混合神经表征;引入选择性神经可塑性训练策略;设计融合Focal Loss与标签平滑的脑启发注意力调制损失函数;采用类不平衡感知采样和带热重启的余弦退火优化策略。 Result: 在二分类任务上取得具有竞争力的准确率、召回率、F1分数和AUC,同时保持高计算效率。 Conclusion: 验证了脑启发学习机制在轻量级医学图像分析中的有效性,为资源受限场景下的智能临床决策支持提供了生物可解释且可部署的解决方案。 Abstract: Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.[84] Out-of-Distribution Detection Based on Total Variation Estimation
Dabiao Ma,Zhiba Su,Jian Yang,Haojun Fei
Main category: cs.CV
TL;DR: 本文提出了一种名为TV-OOD的新型分布外检测方法,利用总变差网络估计器计算输入样本对总体总变差的贡献(即总变差得分),从而有效区分分布内与分布外数据,在图像分类任务中性能媲美或优于现有前沿方法。
Details
Motivation: 现有分布外检测方法虽效果良好,但在应对实际应用中模型部署可能面临的分布偏移问题上仍有提升空间,需更鲁棒、可解释的检测机制。 Method: 提出TV-OOD方法,核心是使用总变差网络估计器(Total Variation Network Estimator)为每个输入计算其对模型输出总变差的贡献,定义为总变差得分,并以此作为判别分布内/外数据的依据。 Result: 在多个模型和数据集上的实验表明,TV-OOD在图像分类任务的各类评估指标下,性能均达到或超过当前主流分布外检测方法。 Conclusion: TV-OOD是一种有效、通用且性能优越的分布外检测方法,为提升机器学习模型在分布偏移场景下的部署鲁棒性提供了新思路。 Abstract: This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input's contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method's efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.[85] PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis
Yifan Chen,Fei Yin,Hao Chen,Jia Wu,Chao Li
Main category: cs.CV
TL;DR: 本文提出了首个公开的、完全配对的泛癌种医学影像数据集,涵盖11个人体器官,支持MRI动态增强(DCE)和CT平扫/增强(CT/CTC)的多阶段图像翻译任务,并构建了综合基准以推动无造影剂图像合成研究。
Details
Motivation: 现有AI驱动的无造影剂图像合成方法受限于数据稀缺:公共数据集多局限于脑部MR配对数据;其他数据存在配对不全、模态/时间戳缺失、空间错位及缺乏明确增强阶段标注等问题;大量高质量数据仍为私有。 Method: 构建首个公开、全配对、跨11器官的泛癌种医学影像数据集(含完整DCE1-DCE3序列和CT/CTC配对),强调解剖一致性;设计支持1-to-1、N-to-1、N-to-N翻译的严格评估框架;在该数据集上系统评测主流图像到图像翻译模型。 Result: 建立了目前最全面的对比剂合成基准,报告了多种代表性模型在多器官、多模态、多时相设置下的性能结果;数据集与代码已开源,推动安全、有效的无造影剂成像研究。 Conclusion: 该工作通过高质量、广覆盖、结构化配对的数据集与标准化基准,显著缓解了医学图像合成领域的数据瓶颈,为多器官肿瘤影像临床流程优化提供了关键基础设施支撑。 Abstract: Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.[86] Understanding the Transfer Limits of Vision Foundation Models
Shiqi Huang,Yipei Wang,Natasha Thorley,Alexander Ng,Shaheer Saeed,Mark Emberton,Shonit Punwani,Veeru Kasivisvanathan,Dean Barratt,Daniel Alexander,Yipeng Hu
Main category: cs.CV
TL;DR: 本文探讨了视觉基础模型(VFMs)在下游任务中表现不均衡的问题,提出预训练目标与下游任务需求之间的不匹配是主要原因,并通过前列腺多参数MRI任务验证了预训练与下游任务对齐程度对迁移性能的影响。
Details
Motivation: 视觉基础模型(VFMs)在下游任务中表现不均衡,可能源于预训练目标(如掩码图像重建或对比学习)与下游任务(如分割、分类、图像合成)的具体需求不匹配。 Method: 在前列腺多参数MRI的五个临床任务上评估两种VFMs(MAE-based的ProFound和对比学习的ProViCNet),并用最大均值差异(MMD)等简单散度指标衡量预训练与下游任务的对齐程度。 Result: 预训练与下游任务对齐程度越高(MMD越小),微调性能提升越大、收敛越快。 Conclusion: 设计和分析预训练目标时应充分考虑其在下游任务中的适用性,任务对齐是提升VFMs迁移性能的关键因素。 Abstract: Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.[87] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Anas Anwarul Haq Khan,Mariam Husain,Kshitij Jadhav
Main category: cs.CV
TL;DR: RadJEPA 是一种无需语言监督的自监督医学视觉表征学习框架,基于联合嵌入预测架构,在无标注胸部X光图像上预训练,通过预测掩码区域的潜在表示来学习,性能超越现有方法如 Rad-DINO。
Details
Motivation: 现有医学视觉语言模型依赖配对的图像-文本数据,但这类数据稀缺;本文旨在探索不依赖语言监督、仅用无标注影像数据能否学习鲁棒的放射学编码器。 Method: 提出 RadJEPA 框架,基于联合嵌入预测架构(JEPA),在纯无标注胸部X光图像上进行自监督预训练,目标是预测被掩码图像区域的潜在空间表示,区别于图文对齐或DINO式自蒸馏。 Result: 在疾病分类、语义分割和报告生成等多个下游任务上,RadJEPA 性能全面超越当前最优方法(包括 Rad-DINO)。 Conclusion: 仅利用无标注X光图像、无需任何语言监督,RadJEPA 即可学习高质量放射学视觉表征,验证了纯视觉自监督在医学影像领域的有效性与潜力。 Abstract: Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.[88] ThermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling
Zhaoqi Su,Shihai Chen,Xinyan Lin,Liqin Huang,Zhipeng Su,Xiaoqiang Lu
Main category: cs.CV
TL;DR: 本文提出了ThermoSplat框架,通过跨模态FiLM调制和模态自适应几何解耦,实现RGB与热红外数据的深度光谱感知三维高斯点阵重建,在RGBT-Scenes数据集上达到可见光与热红外渲染质量的SOTA。
Details
Motivation: 现有3D高斯点阵(3DGS)方法难以有效融合RGB与热红外多模态数据,常忽略跨模态相关性或无法自适应处理不同光谱间的结构关联与物理差异。 Method: 提出ThermoSplat框架:1)Cross-Modal FiLM Modulation机制,利用热成像结构先验动态调节共享隐特征以指导可见光纹理合成;2)Modality-Adaptive Geometric Decoupling方案,为热分支学习独立不透明度偏移并执行独立光栅化;3)混合渲染管线,结合显式球谐函数与隐式神经解码。 Result: 在RGBT-Scenes数据集上,ThermoSplat在可见光与热红外两个谱段均取得当前最优渲染质量。 Conclusion: ThermoSplat通过谱感知特征调制与几何解耦,有效建模多光谱互补性与差异性,为多模态场景重建提供了新范式。 Abstract: Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.[89] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models
Zhen Zhang,Runhao Zeng,Sicheng Zhao,Xiping Hu
Main category: cs.CV
TL;DR: 本文通过系统性机制研究发现,多模态基础模型中的情感建模主要依赖于前馈网络中的门控投影层(gate_proj),而非注意力模块;仅微调该模块即可实现接近全参数微调的性能,显著提升参数效率。
Details
Motivation: 尽管现有情感模型表现出色,但其内部支持情感理解与生成的架构机制仍不清楚,尤其在多模态情感场景中。 Method: 在多种架构、训练策略和情感任务上,分析情绪导向监督如何重塑模型内部参数;采用受控模块迁移、单模块针对性适配和破坏性消融实验验证gate_proj的作用。 Result: 情感适配主要定位在feed-forward gating projection(gate_proj);仅调优约24.5%的AffectGPT参数,即可达到其96.6%的平均性能;gate_proj被证实为充分、高效且必要的情感建模组件。 Conclusion: 情感能力在基础模型中由前馈门控机制结构性介导,gate_proj是情感建模的核心架构位点。 Abstract: Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate\_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate\_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5\% of the parameters tuned by AffectGPT, our approach achieves 96.6\% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate\_proj} as a central architectural locus of affective modeling.[90] The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars
Yarin Benyamin
Main category: cs.CV
TL;DR: 本文研究了在VR环境中为自闭症谱系障碍(ASD)患者提供实时情绪识别支持的可行性,发现现有通用深度学习模型难以满足低延迟(<140ms)与高精度的双重需求,尤其在分类阶段存在‘延迟墙’;YOLOv11n在人脸检测中表现最优,而CLIP、SigLIP等通用视觉Transformer在准确率和速度上均不达标,亟需轻量级、领域专用架构。
Details
Motivation: 为ASD患者开发可及的VR辅助社交技能训练系统,需满足严格的运动到光子(MTP)延迟约束(<140ms),但现有SOTA深度学习模型偏向精度而非实时性。 Method: 在UIBVFED数据集上,对面向虚拟角色的零样本面部表情识别(FER)任务,系统评测了YOLO系列(v8/v11/v12的Medium/Nano变体)的人脸检测性能,以及CLIP、SigLIP、ViT-FER等通用视觉Transformer的分类性能,全部基于CPU推理。 Result: 人脸检测在风格化虚拟头像上鲁棒(100%准确率),YOLOv11n检测延迟约54ms;但分类阶段存在‘延迟墙’,CLIP和SigLIP准确率<23%且延迟>150ms,无法满足实时闭环要求。 Conclusion: 通用Transformer模型不适用于VR治疗中的实时情绪识别,必须设计轻量级、面向虚拟角色FER的专用架构,以兼顾低延迟与可用精度,推动可及的AI辅助VR疗法落地。 Abstract: In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.[91] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Valery Fischer,Alan Magdaleno,Anna-Katharina Calek,Nicola Cavalcanti,Nathan Hoffman,Christoph Germann,Joschua Wüthrich,Max Krähenmann,Mazda Farshad,Philipp Fürnstahl,Lilian Calvet
Main category: cs.CV
TL;DR: 本文提出了一种无需领域微调、基于现成预训练模型的多视角3D手部姿态估计方法,并构建了一个包含68,000帧和3,000组人工标注2D/3D手部关键点的新手术基准数据集,在复杂手术环境中显著提升了精度。
Details
Motivation: 手术环境光照强烈且不均、手部常被遮挡、戴手套导致外观单一,加之缺乏带标注数据,使得现有3D手部姿态估计方法难以适用。 Method: 构建一个端到端多视角流程:先进行人体检测与全身姿态估计,再在跟踪的手部裁剪区域上运行SOTA 2D手部关键点检测器,最后通过约束性3D优化得到三维姿态;同时发布首个大规模手术场景手部姿态基准数据集(含2D标注与三角化3D真值)。 Result: 相比基线方法,2D平均关节点误差降低31%,3D平均每关节位置误差降低76%。 Conclusion: 本工作为手术场景下的3D手部姿态估计提供了无需训练的实用方案和高质量开源数据集,奠定了该方向后续研究的重要基础。 Abstract: Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.[92] Class Confidence Aware Reweighting for Long Tailed Learning
Brainard Philemon Jagati,Jitendra Tembhurne,Harsh Goud,Rudra Pratap Singh,Chandrashekhar Meshram
Main category: cs.CV
TL;DR: 本文提出了一种基于损失水平的类别与置信度感知重加权方案,用于解决长尾数据分布下的深度神经网络性能退化问题,该方案与现有logit调整方法互补,并在多个长尾数据集上验证了其有效性。
Details
Motivation: 深度神经网络在长尾数据分布下性能显著下降,现有方法主要关注决策空间(如logit层)的调整以补偿类别先验偏差,而较少关注由样本置信度差异引起的优化过程问题。 Method: 提出一种纯基于损失水平的类别与置信度感知重加权方案,使用Ω(p_t, f_c)函数根据预测置信度和类别相对频率动态调节样本对训练的贡献。 Result: 在CIFAR-100-LT、ImageNet-LT和iNaturalist2018等多个长尾数据集上,不同不平衡因子下均取得显著提升,实验结果支持理论分析。 Conclusion: 所提重加权方案能有效缓解长尾学习中的类别不平衡问题,且与logit调整类方法具有互补性,为长尾学习提供了新思路。 Abstract: Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an Ω(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.[93] NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation
Liuyun Jiang,Yizhuo Lu,Yanchao Zhang,Jiazheng Liu,Hua Han
Main category: cs.CV
TL;DR: 本文提出NeuroMamba,一种基于Mamba架构的多视角神经元分割框架,通过无patch全局建模与局部细节保留相结合,在电子显微镜图像中实现高精度神经元分割。
Details
Motivation: 现有CNN方法缺乏长程上下文,Transformer方法因分块导致体素级细节丢失,难以应对神经元不规则形态和密集缠绕结构带来的边界模糊问题。 Method: 提出NeuroMamba框架:1)通道门控的边界判别特征提取器(BDFE)增强局部形态线索;2)融合分辨率感知扫描机制的空域连续特征提取器(SCFE),适配不同数据分辨率下的全局依赖建模;3)跨调制机制融合多视角特征;整体利用Mamba的线性复杂度实现patch-free全局建模。 Result: 在四个公开EM数据集上达到SOTA性能,验证了其对各向异性和各向同性分辨率数据的强适应性。 Conclusion: NeuroMamba有效平衡了长程依赖建模与细粒度细节保持,为高精度、鲁棒的神经元分割提供了新范式。 Abstract: Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.[94] EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis
Sheng Miao,Sijin Li,Pan Wang,Dongfeng Bai,Bingbing Liu,Yue Wang,Andreas Geiger,Yiyi Liao
Main category: cs.CV
TL;DR: EvolSplat4D是一种前馈式新型视图合成框架,通过三分支结构统一处理近场静态、动态目标和远场场景,兼顾重建质量与效率,在多个自动驾驶数据集上超越现有方法。
Details
Motivation: 现有神经辐射场和3D高斯溅射方法依赖耗时的逐场景优化,而新兴前馈方法采用逐像素高斯表示,导致复杂动态城市环境中的3D不一致性。 Method: 提出EvolSplat4D前馈框架:1)针对近距静态区域,从3D特征体预测多帧一致的3D高斯几何,并用语义增强图像渲染模块预测外观;2)针对动态目标,采用以对象为中心的规范空间与运动调整渲染模块聚合时序特征;3)针对远场场景,使用高效逐像素高斯分支保障全场景覆盖。 Result: 在KITTI-360、KITTI、Waymo和PandaSet数据集上,EvolSplat4D在静态与动态环境重建的精度和一致性上均优于逐场景优化方法及前沿前馈基线。 Conclusion: EvolSplat4D通过融合体积式与像素式高斯预测的三分支设计,有效解决了城市动态场景中高质量与高效率难以兼顾的问题,为自动驾驶仿真提供了更实用的新型视图合成方案。 Abstract: Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.[95] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models
Xin Xie,Jiaxian Guo,Dong Gong
Main category: cs.CV
TL;DR: 本文提出HyperAlign框架,通过训练超网络在测试时动态生成低秩适配权重来调节扩散模型的去噪过程,从而在不牺牲多样性的同时高效实现人类偏好对齐。
Details
Motivation: 现有扩散模型虽性能优异,但生成结果常与人类偏好和意图不一致,存在美学质量差和语义不一致问题;而现有对齐方法在多样性损失与计算开销之间难以兼顾。 Method: 提出HyperAlign框架,利用超网络动态生成低秩适配权重,调制扩散模型的生成算子,依据输入隐状态、时间步和提示自适应调整去噪轨迹;设计多种应用频率变体,并以带偏好数据正则化的奖励分数为目标优化超网络。 Result: 在Stable Diffusion和FLUX等多个生成范式上显著优于现有微调与测试时缩放基线,在语义一致性与视觉吸引力方面均有提升。 Conclusion: HyperAlign实现了高效、灵活且鲁棒的测试时对齐,在保持生成多样性的同时提升了人类偏好一致性,为扩散模型对齐提供了新范式。 Abstract: Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.[96] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
Tingyu Song,Yanzhao Zhang,Mingxin Li,Zhuoning Guo,Dingkun Long,Pengjun Xie,Siyue Zhang,Yilun Zhao,Shu Wu
Main category: cs.CV
TL;DR: 本文提出EDIR,一个基于图像编辑生成的细粒度合成图像检索基准,涵盖5000个高质量查询,用于评估多模态嵌入模型在真实场景中的泛化能力。
Details
Motivation: 现有CIR基准类别有限、无法反映真实需求,缺乏对模型细粒度能力的全面评估。 Method: 利用图像编辑技术精确控制修改类型与内容,构建覆盖广泛类别的合成查询流水线,并据此创建EDIR基准;对13种多模态嵌入模型进行系统评测,并开展域内训练实验以分析任务难点。 Result: 当前最优模型(如RzenEmbed、GME)在EDIR各子类上表现不一致,暴露出显著能力差距;发现现有基准存在模态偏差和类别覆盖不足等问题;域内训练可提升部分子类性能,但某些子类仍暴露模型架构固有局限。 Conclusion: EDIR是一个更具挑战性和现实代表性的CIR基准,能有效揭示模型短板,推动更鲁棒、细粒度的多模态理解研究。 Abstract: Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.[97] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
Chak-Wing Mak,Guanyu Zhu,Boyi Zhang,Hongji Li,Xiaowei Chi,Kevin Zhang,Yichen Wu,Yangfan He,Chun-Kai Fan,Wentao Lu,Kuangzhi Ge,Xinyu Fang,Hongyang He,Kuan Lu,Tianxiang Xu,Li Zhang,Yongxin Ni,Youhua Li,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出了PhysicsMind基准,用于评估多模态大模型和视频世界模型对物理规律(质心、杠杆平衡、牛顿第一定律)的理解能力,涵盖视觉问答与视频生成两大任务,并发现现有模型仍严重依赖表观启发式而违背基本力学原理。
Details
Motivation: 现有基准无法有效衡量模型对物理规律的理解,多依赖合成模板或感知质量,缺乏对物理守恒律一致性的评估。 Method: 构建了包含真实场景与仿真环境的统一基准PhysicsMind,设计VQA和视频生成两类任务,分别测试物理量推理与运动轨迹是否符合质心、力矩和惯性约束。 Result: 在PhysicsMind上评测多种先进模型,发现其普遍依赖外观启发式,频繁违反基础力学原理。 Conclusion: 当前模型的缩放与训练策略尚不足以实现鲁棒的物理理解,PhysicsMind为物理感知多模态建模提供了聚焦的评测平台。 Abstract: Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.[98] Keyframe-Based Feed-Forward Visual Odometry
Weichen Dai,Wenhan Su,Da Kong,Yuhang Ming,Wanzeng Kong
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的自适应关键帧选择策略,用于视觉里程计(VO),在保持前馈网络结构的同时提升效率与精度。
Details
Motivation: 现有基于视觉基础模型的VO方法(如VGGT-Long)直接处理原始图像序列,忽视帧间视差,导致计算冗余和性能下降;而传统几何启发式关键帧方法难以适配依赖高维隐表示的基础模型。 Method: 提出一种关键帧驱动的前馈VO框架,利用强化学习在数据驱动下学习自适应关键帧选择策略,使其适配基础模型的内在表征特性,并在TartanAir数据集上训练。 Result: 在多个真实世界数据集上实验表明,该方法一致且显著优于当前最先进的前馈VO方法。 Conclusion: 将强化学习引入关键帧选择可有效弥合基础模型与传统几何优化之间的鸿沟,在不牺牲前馈特性的前提下提升VO的效率与精度。 Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.[99] PAINT: Pathology-Aware Integrated Next-Scale Transformation for Virtual Immunohistochemistry
Rongze Ma,Mengkang Lu,Zhenyu Xiang,Yongsheng Pan,Yicheng Wu,Qingjie Zeng,Yong Xia
Main category: cs.CV
TL;DR: 本文提出PAINT框架,通过结构优先的自回归生成方法,利用空间结构起始图(3S-Map)从H&E图像合成虚拟免疫组化(IHC)图像,在结构保真度和临床下游任务中优于现有方法。
Details
Motivation: 传统IHC染色成本高、耗组织;而仅靠H&E形态推断蛋白表达存在模糊性和分子状态歧义,现有外观直接合成方法缺乏足够结构先验,导致语义不一致。 Method: 提出Pathology-Aware Integrated Next-Scale Transformation (PAINT),一种视觉自回归框架:将合成建模为‘结构先行’的条件生成任务,引入Spatial Structural Start Map (3S-Map)作为形态引导的确定性初始化,按因果顺序逐尺度生成分子细节。 Result: 在IHC4BC和MIST数据集上,PAINT在结构保真度和临床下游任务(如分子亚型分类)中显著优于当前最优方法。 Conclusion: 结构引导的自回归建模范式能有效提升虚拟IHC合成的语义一致性和临床可用性,为数字病理跨模态生成提供新范式。 Abstract: Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\&E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H\&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.[100] ProGiDiff: Prompt-Guided Diffusion-Based Medical Image Segmentation
Yuan Lin,Murong Xu,Marc Hölle,Chinmay Prabhakar,Andreas Maier,Vasileios Belagiannis,Bjoern Menze,Suprosanna Shit
Main category: cs.CV
TL;DR: 本文提出ProGiDiff框架,利用预训练扩散模型和ControlNet式条件机制实现基于自然语言提示的多类别医学图像分割,并支持专家交互与跨模态迁移。
Details
Motivation: 现有医学图像分割方法缺乏对自然语言提示的支持、多提案生成能力、人机交互性及跨模态适应性;而从头训练文本到图像扩散模型在医学领域受限于数据稀缺且难以支持多类分割和语言提示。 Method: 提出ProGiDiff框架,采用ControlNet风格的条件机制与定制编码器,将预训练扩散模型引导输出分割掩码;通过自然语言提示指定目标器官,天然支持多类别分割;并利用低秩少量样本适配实现跨模态(CT→MR)迁移。 Result: 在CT器官分割任务上性能优于先前方法;支持专家参与的多提案生成;经少量样本微调即可有效迁移到MR图像分割。 Conclusion: ProGiDiff为医学图像分割提供了灵活、可提示、可交互且可迁移的新范式,有效弥补了传统分割模型与生成式AI之间的鸿沟。 Abstract: Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.[101] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Chenyang Li,Jieyuan Liu,Bin Li,Bo Gao,Yilin Yuan,Yangfan He,Yuchen Li,Jingqun Tang
Main category: cs.CV
TL;DR: 本文提出了一种即插即用的‘干扰令牌剪枝’(DTP)框架,用于动态检测并剪除视觉语言动作(VLA)模型中任务无关区域的干扰图像令牌,从而提升任务成功率,且不改变模型架构或增加额外输入。
Details
Motivation: VLA模型默认可能过度关注任务无关区域的图像令牌(即‘干扰令牌’),干扰动作生成,降低任务成功率。 Method: 提出Distracting Token Pruning(DTP)框架,动态检测并剪枝干扰图像令牌,修正模型视觉注意力模式。 Result: 在SIMPLER基准上,DTP在多种新型VLA模型上均取得一致的相对成功率提升;分析发现任务成功率与任务无关区域注意力强度呈负相关。 Conclusion: DTP是一种简单有效、通用性强的即插即用方法,揭示了VLA模型中普遍存在的注意力偏差现象,为未来研究提供新方向。 Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.[102] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models
Hanwen Zhang,Qiaojin Shen,Yuxi Liu,Yuesheng Zhu,Guibo Luo
Main category: cs.CV
TL;DR: DSFedMed is a dual-scale federated framework for medical image segmentation that enables mutual knowledge distillation between a centralized foundation model and lightweight client models, improving performance while drastically reducing communication and inference costs.
Details
Motivation: Foundation Models (FMs) face challenges in federated settings due to high computational demands, communication overhead, and inference costs—especially critical in resource-limited medical applications. Method: DSFedMed introduces mutual knowledge distillation between a centralized FM and lightweight client models; it uses synthetically generated high-quality medical images and a learnability-guided sample selection strategy to enhance distillation efficiency and effectiveness. Result: On five medical imaging segmentation datasets, DSFedMed achieves ~2% average Dice score improvement and reduces communication costs and inference time by ~90% compared to existing federated FM baselines. Conclusion: DSFedMed significantly improves efficiency and scalability of foundation models in federated medical image segmentation, enabling practical deployment under resource constraints. Abstract: Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.[103] Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian,Siwei Zhang,Bharat Lal Bhatnagar,Federica Bogo,Siyu Tang
Main category: cs.CV
TL;DR: MoRo提出了一种基于掩码建模的端到端生成式方法,用于单目视频中抗遮挡的人体运动重建,兼顾高精度、高真实感与实时性(70 FPS)。
Details
Motivation: 现有方法在真实场景频繁遮挡下表现脆弱:回归法高效但鲁棒性差;优化/扩散法鲁棒但推理慢、预处理重。同时缺乏充足的配对视频-运动数据。 Method: 提出MoRo框架:1)以视频为条件的掩码建模任务;2)跨模态学习方案,融合三类先验——MoCap轨迹先验、图像姿态先验、视频条件掩码Transformer(融合前两者并微调);3)端到端训练,直接从RGB视频输出全局坐标系下人体运动。 Result: 在EgoBody和RICH数据集上显著优于SOTA方法,尤其在遮挡场景下提升精度与运动真实性;非遮挡场景性能相当;单H200 GPU达70 FPS实时推理。 Conclusion: MoRo通过掩码建模与跨模态先验融合,实现了高效、鲁棒、实时的人体运动重建,有效解决了单目视频中遮挡下的运动估计难题。 Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.[104] SAMTok: Representing Any Mask with Two Words
Yikang Zhou,Tao Zhang,Dengxian Gong,Yuanzheng Wu,Ye Tian,Haochen Wang,Haobo Yuan,Jiacong Wang,Lu Qi,Hao Fei,Anran Wang,Zhuochen Wang,Yujing Wang,Cheng Chen,Shunping Ji,Xiangtai Li
Main category: cs.CV
TL;DR: 本文提出SAMTok,一种离散掩码分词器,将任意区域掩码转换为两个特殊标记,并通过标准语言建模和简单强化学习使基础多模态大模型(如QwenVL)获得像素级理解与生成能力,无需修改架构或设计专用损失函数。
Details
Motivation: 现有像素级多模态大语言模型(MLLMs)因依赖复杂区域编码器、专用分割解码器及不兼容训练目标而难以扩展。 Method: 提出SAMTok:基于SAM2的离散掩码分词器,使用掩码编码器和残差向量量化器将209M掩码压缩为信息丰富、紧凑的离散标记;将掩码视为新语言标记,使基础MLLM通过next-token预测和文本答案匹配奖励驱动的强化学习进行训练。 Result: 在区域描述、区域VQA、接地对话、指代分割、场景图解析和多轮交互分割等任务上达到SOTA或相当水平;在GRES和GCG基准上通过强化学习显著提升性能。 Conclusion: SAMTok提供了一种可扩展、简洁且通用的范式,使多模态大模型高效获得强像素级能力,且代码与模型已开源。 Abstract: Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.[105] Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification
Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Quinn Ledingham,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 本文提出CSSMamba框架,通过聚类引导的空间-光谱Mamba结构和注意力驱动的令牌选择机制,提升高光谱图像分类性能。
Details
Motivation: Mamba模型在高光谱图像(HSI)分类中虽有提升,但在定义高效自适应令牌序列方面存在关键挑战。 Method: 提出CSSMamba框架,包括:1)聚类引导的空间Mamba模块(CSpaMamba)以缩短序列长度并增强特征学习;2)结合光谱Mamba模块(SpeMamba)构建完整空间-光谱架构;3)引入注意力驱动的令牌选择机制优化令牌序列;4)设计可学习聚类模块自适应学习聚类成员关系。 Result: 在Pavia University、Indian Pines和Liao-Ning 01数据集上,CSSMamba在分类精度和边界保持能力上均优于当前主流CNN、Transformer及Mamba方法。 Conclusion: CSSMamba通过融合聚类引导、空间-光谱建模与注意力驱动令牌选择,有效提升了HSI分类性能,验证了其在序列建模与特征学习上的有效性与鲁棒性。 Abstract: Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.[106] Learning to Watermark in the Latent Space of Generative Models
Sylvestre-Alvise Rebuffi,Tuan Tran,Valeriu Lacatusu,Pierre Fernandez,Tomáš Souček,Nikola Jovanović,Tom Sander,Hady Elsahar,Alexandre Mourachko
Main category: cs.CV
TL;DR: 本文提出DistSeal,一种在生成模型潜在空间中进行水印嵌入与检测的统一方法,支持扩散模型和自回归模型;通过在潜在空间训练后处理水印模型并将其蒸馏进生成模型或解码器,实现高效、鲁棒且不可感知的水印。
Details
Motivation: 现有图像水印方法多在像素空间进行后处理,存在计算开销大、易引入视觉伪影等问题,亟需更高效、更鲁棒的水印方案。 Method: 提出潜在空间水印框架DistSeal:先在生成模型的潜在空间中训练后处理水印模型,再将其蒸馏至生成模型本体或潜在解码器中,实现端到端的潜在水印嵌入与检测。 Result: 所提方法在鲁棒性上媲美像素空间基线,同时保持同等不可感知性,并获得最高达20倍的推理加速;蒸馏潜在水印模型的效果显著优于蒸馏像素水印模型。 Conclusion: 潜在空间水印是一种更优范式,DistSeal为跨架构(扩散/自回归)生成模型提供了高效、鲁棒、轻量的水印解决方案。 Abstract: Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.[107] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Remy Sabathier,David Novotny,Niloy J. Mitra,Tom Monnier
Main category: cs.CV
TL;DR: ActionMesh是一种基于时间维度扩展的3D扩散模型,能快速生成高质量、拓扑一致、无需绑定骨架的动画3D网格,支持视频、文本或3D网格+文本等多种输入。
Details
Motivation: 现有生成动画3D物体的方法存在设置受限、运行慢、质量低等问题,难以实际应用。 Method: 提出‘时间3D扩散’框架:1)改造3D扩散模型以生成时序同步的3D隐空间序列;2)设计时间3D自编码器,将独立形状序列映射为参考形状的形变序列,从而构建动画。 Result: 在Consistent4D和Objaverse等标准基准上达到几何精度与时间一致性SOTA;生成速度快、结果无绑定骨架、拓扑一致,便于纹理映射与动作重定向。 Conclusion: ActionMesh实现了高质量、高效率、易集成的动画3D网格生成,显著提升了生成式3D内容在实际生产中的可用性。 Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.[108] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
Zequn Xie,Xin Liu,Boyun Zhang,Yuxiao Lin,Sihang Cai,Tao Jin
Main category: cs.CV
TL;DR: 本文提出了一种受人类视觉启发的文本-视频检索模型HVD,通过粗到细的对齐机制(关键帧选择和补丁特征压缩)提升检索性能,在五个基准上达到SOTA。
Details
Motivation: 现有方法存在“盲”特征交互问题,难以从背景噪声中识别关键视觉信息,因文本查询稀疏导致匹配不精准。 Method: 提出Human Vision-Driven (HVD)模型,包含Frame Features Selection Module (FFSM)用于选择关键帧消除时序冗余,以及Patch Features Compression Module (PFCM)通过先进注意力机制聚合补丁特征为显著视觉实体,实现粗到细对齐。 Result: 在五个文本-视频检索基准上取得SOTA性能,并验证了模型具备类人视觉聚焦能力。 Conclusion: HVD模型通过模拟人类宏观与微观感知机制,有效缓解文本稀疏性带来的视觉干扰问题,提升了文本-视频跨模态对齐精度与检索效果。 Abstract: The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.[109] 360Anything: Geometry-Free Lifting of Images and Videos to 360°
Ziyi Wu,Daniel Watson,Andrea Tagliasacchi,David J. Fleet,Marcus A. Brubaker,Saurabh Saxena
Main category: cs.CV
TL;DR: 本文提出360Anything,一种无需几何先验和相机参数的扩散Transformer框架,实现从单张图像/视频到360°全景图的端到端生成,并通过Circular Latent Encoding解决ERP边界接缝问题,同时展现出隐式几何理解能力。
Details
Motivation: 现有方法依赖已知相机参数进行几何对齐,难以应用于野外未知标定数据;需摆脱对显式几何建模和相机元数据的依赖。 Method: 基于预训练扩散Transformer,将视角图像与ERP全景图均视为token序列,纯数据驱动学习映射关系;引入Circular Latent Encoding以消除VAE编码器零填充导致的ERP边界 seam。 Result: 在图像和视频的视角到360°生成任务上达到SOTA,超越使用真实相机参数的先前方法;在零样本FoV与朝向估计基准上表现具竞争力。 Conclusion: 360Anything验证了纯数据驱动、几何无关的生成范式可行性,兼具高质量生成能力与隐式几何理解,拓展了扩散模型在沉浸式3D内容生成与基础视觉任务中的应用边界。 Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.[110] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong,Boyang Zheng,Ziteng Wang,Bingda Tang,Nanye Ma,Ellis Brown,Jihan Yang,Rob Fergus,Yann LeCun,Saining Xie
Main category: cs.CV
TL;DR: 本文研究了Representation Autoencoders (RAEs)在大规模文本到图像生成中的可扩展性,发现其相比VAE更稳定、收敛更快、生成质量更高,并支持统一的多模态表征与推理。
Details
Motivation: 探索RAE框架能否从ImageNet尺度扩展到大规模、自由形式的文本到图像(T2I)生成任务,并验证其在大模型规模下的有效性与简化潜力。 Method: 在冻结SigLIP-2编码器基础上扩展RAE解码器,使用网络、合成及文本渲染数据训练;系统评估RAE原始设计选择(如噪声调度、扩散头宽度、噪声增强解码)在大规模下的必要性;在0.5B–9.8B参数范围内与FLUX VAE进行控制变量对比实验,涵盖预训练与微调阶段。 Result: RAE在所有模型规模下预训练均优于VAE;微调中VAE在64轮后灾难性过拟合,而RAE稳定至256轮且性能更优;RAE收敛更快、生成质量更高;共享表征空间支持视觉理解与生成联合推理。 Conclusion: RAE是比VAE更简单、更强的大规模T2I生成基础架构,兼具稳定性、可扩展性与多模态统一潜力。 Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.[111] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Onkar Susladkar,Tushar Prakash,Adheesh Juvekar,Kiet A. Nguyen,Dong-Hwan Jang,Inderjit S Dhillon,Ismini Lourentzou
Main category: cs.CV
TL;DR: 本文提出了PyraTok,一种语言对齐的金字塔式视频分词器,通过多尺度语义结构化离散潜在表示提升跨模态对齐与零样本迁移能力,在多项视频任务上达到SOTA性能。
Details
Motivation: 现有离散视频VAE通常在单一尺度、有限词表和浅层语言监督下学习视觉码本,导致跨模态对齐差、零样本迁移能力弱。 Method: PyraTok基于预训练视频VAE,引入语言对齐金字塔量化(LaPQ)模块,在多个时空分辨率和编码器深度上使用共享大二进制码本进行离散化,并联合优化多尺度文本引导量化与层级自回归目标。 Result: 在十个基准上实现SOTA视频重建效果;显著提升文本到视频生成质量;在视频分割、时序动作定位和视频理解等任务中取得新SOTA零样本性能,并可稳健扩展至4K/8K分辨率。 Conclusion: PyraTok通过语言对齐与金字塔式多尺度离散化,有效增强了视频表征的语义性、紧凑性与跨模态一致性,为视频生成与理解提供了更优的离散化基础。 Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.[112] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
Geo Ahn,Inwoong Lee,Taeoh Kim,Minho Shim,Dongyoon Wee,Jinwoo Choi
Main category: cs.CV
TL;DR: 本文研究了组合视频理解(CVU),发现现有零样本组合动作识别(ZS-CAR)模型因物体驱动的动词捷径而失效;为此提出RCORE框架,通过组合感知增强和时序正则化损失来缓解该问题,并在多个基准上显著提升未见组合的识别准确率。
Details
Motivation: 现有ZS-CAR模型在未见动词-物体组合上泛化能力差,主因是模型依赖物体共现统计而非真正学习动词的视觉语义,即存在‘物体驱动的动词捷径’这一被忽视的失败模式。 Method: 提出RCORE框架:(i)组合感知的数据增强,多样化动词-物体组合同时保留运动线索;(ii)时序顺序正则化损失,显式建模时间结构以抑制捷径行为。 Result: 在Sth-com和新构建的EK100-com两个基准上,RCORE显著提升未见组合识别准确率,降低对共现偏差的依赖,并始终产生正向的组合泛化差距(compositional gap)。 Conclusion: 物体驱动的动词捷径是ZS-CAR的关键瓶颈;只有显式抑制该捷径、强化动词的时序视觉学习,才能实现鲁棒的组合视频理解。 Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.[113] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback
Wenhang Ge,Guibao Shen,Jiawei Feng,Luozhou Wang,Hao Lu,Xingye Tian,Xin Tao,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文提出CamPilot,通过引入相机感知的3D解码器将视频隐空间解码为3D高斯,并利用渲染新视角与真实图像的像素级一致性作为奖励信号,结合可见性约束,显著提升了视频扩散模型的相机可控性。