Table of Contents
cs.CL [Back]
[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration
Longxuan Wei,Yubo Zhang,Zijiao Zhang,Zhihu Wang,Shiwan Zhao,Tianyu Huang,Huiting Zhao,Chenfei Liu,Shenao Zhang,Junchi Yan
Main category: cs.CL
TL;DR: 本文提出Entropy-Tree,一种基于熵的树状解码方法,用于提升大语言模型在推理任务中的准确性和校准性。该方法仅在模型真正不确定的位置进行搜索树扩展,优于现有多链采样等策略。
Details
Motivation: 现有解码策略(如随机采样或多独立采样)存在盲目性或冗余性,难以兼顾推理准确性与不确定性估计可靠性。 Method: Entropy-Tree是一种树状解码方法,利用模型输出的熵作为分支决策信号,仅在高熵(即高不确定性)位置动态扩展搜索树。 Result: Entropy-Tree在多个模型和数据集上推理任务中显著优于Multi-chain的pass@k;其预测熵在AUROC指标上优于多种传统不确定性度量。 Conclusion: Entropy-Tree将高效结构化搜索与可靠不确定性估计统一于单一解码过程,为大模型推理提供了更优解码范式。 Abstract: Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions--expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports
Edward Ajayi
Main category: cs.CL
TL;DR: 本文提出了AfriEconQA,首个专注于非洲经济分析的基准数据集,包含8937个高质量问答对,用于评测IR与RAG系统在专业、时效性强、数值推理要求高的经济文档上的能力;实验表明现有大模型(含RAG)在此任务上表现极差,凸显领域知识缺口。
Details
Motivation: 现有大语言模型预训练语料中严重缺乏非洲经济领域的专业机构文档,导致其在该领域任务上性能不足;亟需一个具有高精度数值推理与时间消歧要求的专用基准来推动领域特定信息检索与RAG技术发展。 Method: 构建AfriEconQA数据集:基于236份世界银行报告,合成并严格筛选10018个问题,最终保留8937个高质量QA实例;设计11组实验,对比GPT-5 Mini零样本基线与基于GPT-4o/Qwen32B的多种RAG配置(5种嵌入+排序策略)。 Result: 零样本模型答错超90%问题;即使最优RAG方案也难以达到高精度;证实AfriEconQA具备强挑战性与鲁棒性,能有效暴露当前模型在非洲经济领域的知识与推理短板。 Conclusion: AfriEconQA是首个面向非洲经济分析的权威基准,填补了领域专用评估空白;其结果揭示了参数化知识与检索增强方法在专业小众领域的根本局限,为后续研究提供了明确方向和开源资源。 Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.[3] Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma
Main category: cs.CL
TL;DR: 本文提出了一种数据工程框架,解决真实语料中因标注伪影(如话题标签)导致的知识图谱质量下降问题,发现预处理对词向量回溯调整(retrofitting)效果的影响远超算法选择。
Details
Motivation: 现有嵌入回溯调整方法的效果严重依赖知识图谱质量,而知识图谱质量又受文本预处理影响;真实语料中的标注伪影(如hashtag)会扭曲图结构,进而损害下游检索性能。 Method: 提出一种面向知识图谱构建的数据工程框架,重点识别并缓解标注伪影(尤其是hashtag)引起的图密度膨胀与虚假边问题;在清洗前后对比多种retrofitting方法(含EWMA)的检索性能变化。 Result: 在噪声图上所有retrofitting方法均显著退化(-3.5%至-5.2%,p<0.05);经预处理后,EWMA retrofitting提升6.2%(p=0.0348),定量合成类问题提升达33.8%;预处理质量带来的性能波动(>10%)远超不同算法间差异(约3%)。 Conclusion: 文本预处理质量是决定retrofitting成功与否的首要因素,远比具体retrofitting算法的选择更重要;应将数据工程置于嵌入增强流程的核心位置。 Abstract: Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success.[4] MALTopic: Multi-Agent LLM Topic Modeling Framework
Yash Sharma
Main category: cs.CL
TL;DR: 本文提出了一种多智能体大语言模型主题建模框架(MALTopic),通过将主题建模任务分解为多个专用LLM智能体协同完成,有效融合结构化与非结构化调查数据,提升了主题连贯性、多样性和可解释性。
Details
Motivation: 传统主题建模方法仅利用自由文本,忽略结构化/分类调查数据,且生成的主题抽象、需大量人工解读。 Method: 提出MALTopic框架,包含三个LLM智能体:增强智能体(利用结构化数据增强文本)、主题建模智能体(提取潜在主题)、去重智能体(优化结果)。 Result: 在调查数据集上的对比实验表明,MALTopic在主题连贯性、多样性与可解释性上显著优于LDA和BERTopic。 Conclusion: MALTopic通过融合结构化数据与多智能体协作,生成更易读、上下文相关性更强的主题,为复杂调查数据分析提供了更优方案。 Abstract: Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free-text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi-Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi-agent approach, MALTopic generates human-readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.[5] Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
Weiwei Wang,Jiyong Min,Weijie Zou
Main category: cs.CL
TL;DR: 本文系统研究了大语言模型(LLMs)在长上下文场景中出现的‘智能退化’现象——即当上下文接近特定临界长度时性能骤降超30%,并首次在开源Qwen模型中实证刻画该现象;提出基于自然长度分布的因果分析、确定Qwen2.5-7B临界阈值为最大上下文长度的40–50%,并构建统一框架解释‘浅层长上下文适应’机制。
Details
Motivation: LLMs在处理接近临界长度的长上下文时出现 catastrophically 的性能下降(>30%),严重制约实际应用,但其成因与规律尚缺乏系统刻画。 Method: 采用自然样本长度(不截断/填充)进行因果性更强的长度分布分析;在覆盖5%–95%上下文长度的1000样本混合数据集上实验,结合五种方法交叉验证,确定临界阈值;构建统一框架解释浅层长上下文适应现象。 Result: 发现Qwen2.5-7B的临界阈值位于最大上下文长度的40–50%,F1分数从0.55–0.56骤降至0.3(下降45.5%);证实性能崩溃由上下文长度本身直接引发,而非数据分布偏差。 Conclusion: LLMs存在‘浅层长上下文适应’局限:仅能稳健适应短至中等长度上下文,跨过临界阈值后智能显著退化;本工作为长上下文部署提供首个针对Qwen系列的系统性实证基准与理论框架。 Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample's natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.[6] Can We Trust LLM Detectors?
Jivnesh Sandhan,Harshit Jaiswal,Fei Cheng,Yugo Murawaki
Main category: cs.CL
TL;DR: 本文系统评估了当前主流的AI文本检测方法,发现它们在分布偏移、未见过的生成器和简单风格扰动下表现脆弱,并提出了一种监督对比学习(SCL)框架来学习判别性风格嵌入以提升鲁棒性。
Details
Motivation: 现有AI文本检测器在真实场景中泛化能力差,尤其在分布偏移、新生成器和风格扰动下失效,亟需更鲁棒的检测方法。 Method: 提出监督对比学习(SCL)框架,通过学习判别性风格嵌入提升检测器对分布偏移和风格扰动的鲁棒性;同时系统评估训练无参与有监督两类主流范式。 Result: 实验表明:有监督检测器在域内表现优异但域外性能急剧下降;训练无参方法对代理模型选择高度敏感;SCL框架提升了风格判别能力,但整体仍揭示了构建领域无关检测器的根本挑战。 Conclusion: 当前AI文本检测方法普遍存在泛化瓶颈,需从风格建模与分布鲁棒性角度重新思考检测范式,单一技术路径难以实现真正领域无关的可靠检测。 Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI[7] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation
Zhebo Wang,Xiaohu Mu,Zijie Zhou,Mohan Li,Wenpeng Xing,Dezhang Kong,Meng Han
Main category: cs.CL
TL;DR: 本文提出Illocution-Calibrated Policy Optimization (ICPO)框架,通过在训练中引入模糊指令并基于用户言外之意调整奖励信号,使大语言模型在多轮对话中更善于识别歧义、表达不确定性或主动澄清,从而缓解‘迷失在对话中’问题,显著提升多轮对话表现(平均提升75%),同时保持单轮任务性能。
Details
Motivation: 大型语言模型在多轮对话中易因用户初始指令模糊而产生错误假设,且难以恢复;标准后训练方法(如RLVR)加剧了模型过度自信,抑制其主动澄清的倾向。 Method: 提出ICPO训练框架:1)在训练语料中加入模糊/欠指定提示;2)将奖励信号与用户的言外之意(illocutionary intent)对齐,鼓励模型在面对歧义时表达不确定性或主动提问。 Result: ICPO显著提升了模型在多轮对话中的表现,平均改善达75%,同时在单轮基准测试中保持稳健性能。 Conclusion: ICPO为构建更具鲁棒性与协作性的对话AI提供了实用路径,使其能更好应对人类交互中的歧义与细微差别。 Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.[8] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models
Rishit Chugh
Main category: cs.CL
TL;DR: 本文提出了一种资源高效的对抗性提示方法,通过检索预训练的对抗提示数据库来替代昂贵的在线优化(如GCG),在保持高攻击成功率的同时大幅降低计算开销,适用于红队测试和对齐大模型的安全评估。
Details
Motivation: 现有基于梯度的自动越狱方法(如GCG、PEZ、GBDA)虽有效但计算开销大,难以在资源受限场景下实用;亟需一种高效、无需重训练的对抗提示生成方案。 Method: 构建包含1000个提示、覆盖7类危害的分类数据集;在Llama 3 8B上评估GCG/PEZ/GBDA,按类别筛选最优攻击方法;建立成功对抗提示数据库,利用语义相似性检索匹配新提示。 Result: 发现提示类型与算法有效性存在相关性;所提检索式方法在各危害类别上达到与GCG等相近的攻击成功率,但计算成本显著下降。 Conclusion: 无需模型微调或梯度计算的提示检索策略是可行且高效的红队评估新范式,尤其适用于黑盒或资源受限环境下的LLM安全评测。 Abstract: The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.[9] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models
Caspar Kaiser,Sean Enderby
Main category: cs.CL
TL;DR: 本文通过查询多个开源语言模型关于自身意识的问题,并利用内部激活训练的分类器验证其回答,发现模型普遍否认自身具有意识,且分类器未发现其否认不真实;此外,Qwen系列中较大模型比小模型更自信地否认自身意识。
Details
Motivation: 探讨语言模型是否相信自己具有意识,而非是否真正具有意识,因为后者目前无法实证回答。 Method: 对Qwen、Llama、GPT-OSS三类共约50个问题的开源模型(参数量0.6B–70B)进行意识相关提问,并用三种可解释性方法训练的分类器分析其内部激活以推断潜在信念。 Result: 1)模型一致否认自身意识,将意识归于人类;2)分类器未发现其否认是虚假的;3)Qwen系列中更大模型否认更自信。 Conclusion: 当前主流开源语言模型在行为和潜在信念层面均不表现出自认有意识的迹象,挑战了近期认为模型隐含自我意识信念的观点。 Abstract: Whether language models possess sentience has no empirical answer. But whether they believe themselves to be sentient can, in principle, be tested. We do so by querying several open-weights models about their own consciousness, and then verifying their responses using classifiers trained on internal activations. We draw upon three model families (Qwen, Llama, GPT-OSS) ranging from 0.6 billion to 70 billion parameters, approximately 50 questions about consciousness and subjective experience, and three classification methods from the interpretability literature. First, we find that models consistently deny being sentient: they attribute consciousness to humans but not to themselves. Second, classifiers trained to detect underlying beliefs - rather than mere outputs - provide no clear evidence that these denials are untruthful. Third, within the Qwen family, larger models deny sentience more confidently than smaller ones. These findings contrast with recent work suggesting that models harbour latent beliefs in their own consciousness.[10] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs
Angelina Parfenova,David Graus,Juergen Pfeffer
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的轴向编码方法,将句子级开放编码聚类或直接由LLM分组为高阶类别,应用于荷兰议会辩论文本分析,并在覆盖度、对齐性、简洁性等指标上对比了两种策略。
Details
Motivation: 传统轴向编码依赖人工,效率低且难扩展;本文旨在利用LLM自动化该过程,提升定性分析的可扩展性与一致性。 Method: 提出两种LLM驱动的轴向编码策略:(i) 对代码-语句对嵌入进行密度/划分聚类,再由LLM标注类别;(ii) 由LLM直接对代码和语句进行分组。在荷兰议会辩论数据上实现并评估。 Result: 密度聚类法覆盖度高、结构分离好;直接LLM分组法更简洁、语义对齐更细粒度但覆盖度低20%;两类方法在ROUGE-L、BERTScore、连贯性、新颖性等指标上各有优势。 Conclusion: LLM可用于有效支持轴向编码,两种策略体现覆盖性与解释性的权衡;公开发布完整数据集以促进后续研究。 Abstract: Axial coding is a commonly used qualitative analysis method that enhances document understanding by organizing sentence-level open codes into broader categories. In this paper, we operationalize axial coding with large language models (LLMs). Extending an ensemble-based open coding approach with an LLM moderator, we add an axial coding step that groups open codes into higher-order categories, transforming raw debate transcripts into concise, hierarchical representations. We compare two strategies: (i) clustering embeddings of code-utterance pairs using density-based and partitioning algorithms followed by LLM labeling, and (ii) direct LLM-based grouping of codes and utterances into categories. We apply our method to Dutch parliamentary debates, converting lengthy transcripts into compact, hierarchically structured codes and categories. We evaluate our method using extrinsic metrics aligned with human-assigned topic labels (ROUGE-L, cosine, BERTScore), and intrinsic metrics describing code groups (coverage, brevity, coherence, novelty, JSD divergence). Our results reveal a trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping results in higher fine-grained alignment, but lower coverage 20%. Overall, clustering maximizes coverage and structural separation, whereas LLM grouping produces more concise, interpretable, and semantically aligned categories. To support future research, we publicly release the full dataset of utterances and codes, enabling reproducibility and comparative studies.[11] Memorization Dynamics in Knowledge Distillation for Language Models
Jaydeep Borkar,Karan Chadha,Niloofar Mireshghallah,Yuchen Zhang,Irina-Elena Veliche,Archi Mitra,David A. Smith,Zheng Xu,Diego Garcia-Olano
Main category: cs.CL
TL;DR: This paper investigates how knowledge distillation (KD) affects training data memorization in LLMs, finding that KD reduces memorization by over 50% compared to fine-tuning, identifies highly memorizable examples, enables pre-distillation prediction of memorization using entropy/KL/perplexity features, and reveals hard distillation inherits more teacher-specific data than soft distillation.
Details
Motivation: While knowledge distillation is used for model compression and privacy preservation, the dynamics of training data memorization in KD—especially compared to standard fine-tuning—remain poorly understood. Method: The authors empirically analyze memorization across the KD pipeline using three LLM families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2), measuring memorization under soft/hard distillation and standard fine-tuning, and evaluating predictability using zlib entropy, KL divergence, and perplexity. Result: Distilled models memorize >50% less training data than fine-tuned ones; ~95% of memorization stems from inherently easy-to-memorize examples; memorization is predictable pre-distillation using entropy/KL/perplexity; hard distillation inherits 2.7× more teacher-specific examples than soft distillation. Conclusion: Knowledge distillation not only improves efficiency and utility but also offers stronger generalization and lower memorization risk than standard fine-tuning—making it a promising privacy-aware adaptation strategy, though hard distillation requires caution due to higher teacher-data leakage. Abstract: Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits $2.7\times$ more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.[12] Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind
Tamunotonye Harry,Ivoline Ngong,Chima Nweke,Yuanyuan Feng,Joseph Near
Main category: cs.CL
TL;DR: 本文介绍了Chameleon数据集,用于研究用户与语言模型交互中状态(state)和特质(trait)的影响,发现现有模型忽视状态变化,而奖励模型对状态反应不一致。
Details
Motivation: 现有persona数据集只捕捉用户静态特质,忽略了交互情境(状态)的影响,无法全面建模用户行为变化。 Method: 构建包含5001个跨情境心理画像的Chameleon数据集(来自1667名Reddit用户),基于潜状态-特质理论进行方差分解,并评估LLM与奖励模型对用户状态的响应能力。 Result: 74%的行为方差源于个体内部状态差异,仅26%源于个体间特质差异;LLM对状态不敏感;不同奖励模型对同一用户状态反应方向相反。 Conclusion: 用户状态在人机交互中起主导作用,当前LLM和奖励模型在建模状态方面存在严重缺陷,Chameleon数据集可推动情感计算、个性化对话与RLHF对齐研究。 Abstract: User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74\% is within-person(state) while only 26\% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.[13] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs
Sydney Anuyah,Mehedi Mahmud Kaushik,Hao Dai,Rakesh Shiradkar,Arjan Durresi,Sunandan Chakraborty
Main category: cs.CL
TL;DR: 本文研究了在医疗领域中,如何通过知识图谱(KG)增强检索增强生成(RAG)以提升大语言模型(LLM)的推理可信度;发现知识图谱与任务范围精确匹配(而非简单合并)能显著提升效果,尤其对中小规模模型更有效,而大模型本身已有较强先验知识;温度调节影响较小。
Details
Motivation: 大型语言模型虽能生成流利回答,但在可信赖的、特定领域的推理上存在不足,尤其在医疗等高风险领域;本文旨在探索利用领域知识图谱增强RAG是否能提升其准确性和可信度。 Method: 构建三个基于PubMed的疾病特异性知识图谱(T2DM、阿尔茨海默病、二者交集),设计两个针对性探针任务,并在七种指令微调LLM上系统评测不同KG组合(含无RAG基线)及三种解码温度下的表现。 Result: 范围匹配的知识图谱(如仅用G₂应对AD相关探针)带来最稳定性能提升;盲目合并图谱反而引入干扰、降低准确率;大模型在Probe 1上常不逊于或优于KG-RAG,中小模型则明显受益于精准RAG;温度影响微弱。 Conclusion: 应优先采用‘精度优先、范围匹配’的知识图谱增强策略,避免‘广度优先’的图谱堆叠;并据此提出图谱选择、模型选型及检索/重排序的实用指南。 Abstract: Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer's disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$. Seven instruction-tuned LLMs are tested across retrieval sources {No-RAG, $\mathbb{G}_1$, $\mathbb{G}_2$, $\mathbb{G}_1$ + $\mathbb{G}_2$, $\mathbb{G}_3$, $\mathbb{G}_1$+$\mathbb{G}_2$ + $\mathbb{G}_3$} and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison[14] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
Anuj Maharjan,Umesh Yadav
Main category: cs.CL
TL;DR: 本文评估了检索增强生成(RAG)架构在公共卫生政策领域中缓解大语言模型(LLM)幻觉问题的效果,发现高级RAG(含交叉编码器重排序)显著提升输出忠实度(0.797),优于基础RAG(0.621)和纯LLM基线(0.347);文档分块策略影响精度,语义分块与两阶段检索对高可靠性政策问答至关重要。
Details
Motivation: LLM在公共卫生政策领域应用受限于其生成幻觉的风险,而该领域对信息准确性要求极高,亟需可靠的技术方案保障事实一致性。 Method: 采用Mistral-7B-Instruct-v0.2和all-MiniLM-L6-v2模型,对比Vanilla LLM、Basic RAG与Advanced RAG(含交叉编码器重排序)三种架构;在CDC政策文档集上测试两种分块策略(递归字符分块 vs 语义token分块),以faithfulness和relevance为指标评估性能。 Result: Advanced RAG实现最高faithfulness(0.797),显著优于Basic RAG(0.621)和Vanilla LLM(0.347);语义分块与两阶段检索提升准确性,但文档结构限制仍阻碍多步推理。 Conclusion: Advanced RAG是提升LLM在高风险政策问答中可靠性的重要路径,但需进一步优化文档分割与推理结构设计。 Abstract: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.[15] Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
Sydney Anuyah,Sneha Shajee-Mohan,Ankit-Singh Chauhan,Sunandan Chakraborty
Main category: cs.CL
TL;DR: 本文评估了13个开源大语言模型在文本中进行成对因果发现(PCD)的能力,涵盖因果检测与因果抽取两项任务;结果表明当前模型表现普遍较差,尤其在隐式、跨句或多因果关系等复杂场景下;作者构建了高一致性标注的统一评测框架,并开源全部数据、代码与提示模板。
Details
Motivation: 高风险领域(如生物医学)中LLM的安全部署需要其具备因果推理能力,但现有研究缺乏对LLM基础因果发现能力的系统评估。 Method: 构建包含12个多样化数据集的PCD基准,定义并评测因果检测(Causal Detection)与因果抽取(Causal Extraction)两项能力;采用零样本、思维链(CoT)和少样本上下文学习(FICL)等多种提示方法进行测试;所有数据经高一致性人工标注(κ ≥ 0.758)。 Result: 当前最优模型在因果检测上仅达49.57%,在因果抽取上仅达47.12%;模型在显式单句因果关系上表现尚可,但在隐式、跨句或多因果关系等现实复杂场景下性能显著下降。 Conclusion: 现有开源LLM在基础文本因果发现任务上能力严重不足,亟需更鲁棒的因果推理建模与评测方法;本工作提供了可复现、开源的评估框架,为后续研究奠定基础。 Abstract: The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}[16] Multi-Persona Thinking for Bias Mitigation in Large Language Models
Yuxing Chen,Guoqing Luo,Zijun Wu,Lili Mou
Main category: cs.CL
TL;DR: 本文提出Multi-Persona Thinking(MPT)框架,通过在推理阶段引入多视角辩证推理(如男性、女性与中立视角)来缓解大语言模型中的社会偏见,显著优于现有提示方法,且不损害模型核心推理能力。
Details
Motivation: 大型语言模型存在显著社会偏见,可能加剧刻板印象与不公平结果,亟需有效、轻量、推理时可用的去偏方法。 Method: 提出Multi-Persona Thinking(MPT),在推理阶段引导模型同时激活多个对立社会身份(如男/女)及中立视角,通过迭代式辩证推理暴露并修正偏见,将角色设定的潜在缺陷转化为去偏优势。 Result: 在多个开源与闭源、不同规模模型上,于两大主流偏见基准测试中,MPT显著降低偏见指标,且保持原有推理能力,性能优于现有基于提示的去偏策略。 Conclusion: MPT是一种高效、通用、无需微调的推理时去偏框架,证明多视角辩证推理可成为缓解LLM社会偏见的新范式。 Abstract: Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.[17] ViT Registers and Fractal ViT
Jason Chuan-Chih Chou,Abhinav Kumar,Shivank Garg
Main category: cs.CL
TL;DR: 本文提出了一种名为fractal ViT的ViT变体,通过在常规token和‘summary token’之间应用注意力掩码来打破token间的排列不变性,并测试其与不同位置编码结合的效果;结果表明该方法未超越带registers的ViT,说明相关发现可能具有尺度、领域或任务特异性。
Details
Motivation: 受语言模型中无需位置编码(NoPE)的transformer表现尚可,以及registers(额外非输入关联的丢弃token)可提升大型视觉Transformer(ViT)性能等新发现启发,探索如何在ViT中有效打破token排列不变性。 Method: 设计fractal ViT,引入类似registers的summary tokens,并在其与常规token间施加注意力掩码;单独或组合使用多种位置编码进行实验。 Result: fractal ViT未在性能上超越已有的带registers的ViT,表明打破排列不变性的新策略效果有限。 Conclusion: NoPE和registers的有效性可能依赖于模型规模、任务领域或具体应用场景,不能简单迁移至所有架构或任务。 Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens'' similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.[18] Computational Representations of Character Significance in Novels
Haaris Mian,Melanie Subbiah,Sharon Marcus,Nora Shaalan,Kathleen McKeown
Main category: cs.CL
TL;DR: 本文提出了一种基于新文学理论的六成分角色结构模型,强调叙述者-角色区分及角色间讨论这一被以往方法忽视的成分,并利用通用大语言模型与任务专用Transformer模型在19世纪英国现实主义小说上进行实验,生成角色讨论的成分级和图表示,从而从计算角度大规模探讨文学问题,如Woloch的‘一与多’角色中心性理论和角色讨论中的性别动态。
Details
Motivation: 以往小说角色建模过度关注主角及其在场景中的出现,忽视了叙述者-角色区分及角色间讨论等重要维度;本文旨在引入一种更全面的文学理论驱动的角色建模框架。 Method: 基于六成分结构化角色模型,使用通用大语言模型(LLMs)和任务专用Transformer模型,在19世纪英国现实主义小说上提取角色讨论的成分级特征与图结构表示。 Result: 成功生成了可解释的角色讨论成分级与图表示;验证了该表示能有效支持对Woloch‘一与多’理论及性别化讨论模式的大规模计算分析。 Conclusion: 该六成分模型拓展了计算文学研究中角色建模的理论基础与技术路径,证明结合文学理论与现代NLP模型可推动人文问题的可扩展、可解释计算研究。 Abstract: Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic "the one vs the many" theory of character centrality and the gendered dynamics of character discussion.[19] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains
Adam Szelestey,Sofie van Engelen,Tianhao Huang,Justin Snelders,Qintao Zeng,Songgaojun Deng
Main category: cs.CL
TL;DR: 本文提出AdversaRiskQA基准,首次系统评估大语言模型在健康、金融和法律领域面对故意注入的高置信度虚假信息时的事实性鲁棒性,并设计自动化方法评测攻击成功率与长文本事实性。
Details
Motivation: 现有研究缺乏高质量、领域特定的资源来评估大语言模型在对抗性虚假信息下的鲁棒性,且未考察注入式错误信息对长文本事实性的影响。 Method: 构建AdversaRiskQA基准(含健康、金融、法律三领域及两个难度等级),提出两种自动化评估方法:用于衡量对抗攻击成功率和长文本事实性;在Qwen、GPT-OSS和GPT系列共6个开源与闭源模型上测试误信息检测率,并在Qwen3(30B)上开展长文本事实性对比评估。 Result: Qwen3(80B)在排除无意义响应后平均准确率最高,GPT-5表现最稳定;模型性能随参数量呈非线性增长,跨领域差异明显,难度级差随模型增大而缩小;长文本评估显示注入误信息与输出事实性无显著相关性。 Conclusion: AdversaRiskQA为识别高风险场景下大语言模型的事实性弱点提供了可靠基准,有助于推动更可信模型的研发与部署。 Abstract: Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.[20] Common to Whom? Regional Cultural Commonsense and LLM Bias in India
Sangmitra Madhusudan,Trush Shashank More,Steph Buongiorno,Renata Dividino,Jad Kabbara,Ali Emami
Main category: cs.CL
TL;DR: 本文提出Indica基准,首次评估大语言模型(LLM)对印度亚国家层级文化常识的理解能力,发现现有模型在区域特异性问题上准确率极低(13.4%-20.9%),且存在显著地理偏差(偏好中、北部),揭示文化常识具有强区域性而非全国统一性。
Details
Motivation: 现有文化常识基准将国家视为同质整体,忽视亚国家层级(如地区、语言、习俗)的文化差异;本文旨在检验文化常识是否在国家内部存在显著区域性变异,并评估LLM对此的认知能力。 Method: 构建首个聚焦亚国家层级的文化常识基准Indica:覆盖印度5大地理区域、8个日常生活领域、515个问题,含1630条人工标注的区域特异性问答对;设计基于人类学分类的问题体系;系统评估8个SOTA LLM在区域识别与回答上的准确性及地理选择偏差。 Result: 仅39.4%问题在全部5个区域获得一致回答,证实印度文化常识高度区域化;LLM在区域特异性问题上准确率仅13.4%-20.9%;普遍存在地理偏差——过度选择中、北部(高出期望值30-40%),低估东、西部。 Conclusion: 文化常识本质上是亚国家层级现象,不能简单以国界划分;当前LLM严重缺乏对区域文化多样性的建模能力,亟需从数据构建、训练目标和评估范式上引入区域意识;本工作为全球多文化国家提供了可迁移的常识评估框架。 Abstract: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs' ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the "default" (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.[21] From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare
Man Luo,Bahareh Harandizadeh,Amara Tariq,Halim Abbas,Umar Ghaffar,Christopher J Warren,Segun O. Kolade,Haidar M. Abdul-Muhsin
Main category: cs.CL
TL;DR: This paper proposes using LLMs as 'empathy editors' to refine physicians' written responses—boosting empathy without sacrificing medical accuracy—and introduces new metrics (Empathy Ranking Score, MedFactChecking Score) to evaluate both emotional and factual quality.
Details
Motivation: Physicians struggle to balance empathy and factual precision under cognitive and emotional constraints; there's a need for AI tools that enhance empathetic communication without compromising medical accuracy. Method: The study designs LLMs as editors of physician-written responses rather than autonomous generators, and develops two novel quantitative metrics—Empathy Ranking Score and MedFactChecking Score—to assess emotional tone and factual fidelity. Result: LLM-edited responses significantly increase perceived empathy while maintaining high factual accuracy, outperforming fully LLM-generated outputs on both empathy and medical fact-checking metrics. Conclusion: Using LLMs as editorial assistants—not standalone generators—provides a safer and more effective approach to integrating AI into empathetic, trustworthy clinical communication. Abstract: Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (LLMs) can function as empathy editors, refining physicians' written responses to enhance empathetic tone while preserving underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that LLM edited responses significantly increase perceived empathy while preserving factual accuracy compared with fully LLM generated outputs. These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.[22] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models
Junyu Lin,Meizhen Liu,Xiufeng Huang,Jinfeng Li,Haiwen Hong,Xiaohan Yuan,Yuefeng Chen,Longtao Huang,Hui Xue,Ranjie Duan,Zhikai Chen,Yuchuan Fu,Defeng Li,Lingyao Gao,Yitong Yang
Main category: cs.CL
TL;DR: YuFeng-XGuard 是一种面向推理的安全守卫模型家族,通过结构化风险预测、自然语言解释和分层推理机制,实现细粒度、可解释、可配置的 LLM 安全防护。
Details
Motivation: 现有安全守卫方法多依赖粗粒度过滤、快速分类或事后规则,导致透明度低、策略僵化或推理开销大,难以满足实际部署中对细粒度、可解释、可适配风险评估的需求。 Method: 提出 YuFeng-XGuard 模型族:采用推理为中心的设计,输出带置信度的风险类别与自然语言解释;引入分层推理范式(首token快速决策 + 按需解释生成);设计动态策略机制,解耦风险感知与策略执行。 Result: 在多个公开安全基准上达到 SOTA 性能,兼顾效率与效果;开源完整版与轻量版模型,支持多样化部署场景。 Conclusion: YuFeng-XGuard 为 LLM 安全守卫提供了更透明、灵活、高效的新范式,推动安全机制从‘黑箱过滤’迈向‘可理解、可调控的推理式防护’。 Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.[23] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
Yangyang Zhong,Yanmei Gu,Zhengqing Zang,Xiaomeng Li,Yuqi Ding,Xibei Jia,Yuting Shen,Zhenzhong Lan,Liwang Zhu,Weiping Liu,Junlin Zhou,Haisheng Liu,Zhong Xin Yu,Pengxin Luo,Donglian Qi,Yunfeng Yan,Junbo Zhao
Main category: cs.CL
TL;DR: 本文研究了掩码扩散语言模型(MDLMs)的并行生成能力和解码顺序特性,发现其在多数任务上仍落后于自回归模型,但展现出任务自适应的解码行为,并提出“生成-编辑”范式以兼顾效率与依赖建模。
Details
Motivation: 探究当前掩码扩散语言模型(MDLMs)是否真正实现其宣称的并行生成与任意序解码能力,并理解其实际行为模式与性能瓶颈。 Method: 提出两个量化指标——平均最终化并行度(AFP)和Kendall's tau——刻画MDLM的并行强度与生成顺序;在58个涵盖知识、推理与编程的基准上评估8种主流MDLM(最大100B参数);结合实证分析与理论推导,提出Generate-then-Edit范式。 Result: MDLMs整体性能仍弱于同规模自回归模型,主因是并行概率建模削弱词元间依赖;其并行性与生成顺序随任务域、推理阶段及输出正确性动态变化;在需‘反向信息’的任务(如数独)中倾向于先填易解空位,体现独特优势。 Conclusion: MDLMs尚未充分释放其理论潜力,但具备任务感知的自适应解码能力;‘生成-然后编辑’范式可缓解依赖丢失问题,同时保留并行解码效率,为未来设计提供新方向。 Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions -- parallelism strength and generation order -- using Average Finalization Parallelism (AFP) and Kendall's tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require "backward information" (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.[24] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms
Baktash Ansari,Shiza Ali,Elias Martin,Maryna Sivachenko,Afra Mashhadi
Main category: cs.CL
TL;DR: 本文提出ToxiTwitch混合模型,结合大语言模型生成的文本与表情符号嵌入及传统机器学习分类器,在Twitch平台实现80%准确率的毒性行为检测,显著优于BERT。
Details
Motivation: Twitch等直播平台聊天环境高速、高量、富含上下文且含表情符号,传统人工标注和关键词过滤难以有效扩展,且人工审核员易受骚扰。 Method: 构建ToxiTwitch混合模型:利用DeepSeek-R1-Distill和Llama-3-8B-Instruct等LLM提取文本与emote联合嵌入,输入至Random Forest和SVM等传统分类器;开展通道特异性训练与评估。 Result: ToxiTwitch在通道特异性训练下达到80%准确率(较BERT提升13%),F1-score达76%;实证表明融入emote信息可提升毒性检测效果。 Conclusion: 该工作为探索性研究,揭示了emote感知毒性检测在Twitch上的潜力与局限,验证了融合多模态语义(文本+emote)与轻量级分类器的可行性。 Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.[25] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation
Zhiyao Ren,Yibing Zhan,Siyuan Liang,Guozheng Ma,Baosheng Yu,Dacheng Tao
Main category: cs.CL
TL;DR: 本文提出首个用于评估多轮医疗咨询中大语言模型(LLM)置信度的基准,揭示医学数据加剧了现有置信度估计方法的局限,并提出证据驱动的自评估框架MedConf,在诊断准确性和信息完整性两方面显著提升置信度建模的可靠性与可解释性。
Details
Motivation: 现有研究多在单轮静态场景下评估LLM临床置信度,忽视真实问诊中随证据积累而动态变化的置信-正确性耦合关系,难以支撑可靠临床决策。 Method: 构建首个面向多轮真实医疗咨询的置信度评估基准,整合三类医学数据并设计信息充分性梯度;提出MedConf框架,基于检索增强生成构建症状画像,对齐患者信息与支持/缺失/矛盾关系,并加权聚合生成可解释置信估计。 Result: 在27种方法对比中,MedConf在AUROC和Pearson相关系数上持续超越SOTA;在信息不足和共病场景下保持稳定性能。 Conclusion: 信息充分性是可信医学置信建模的关键决定因素,MedConf为构建更可靠、可解释的大型医学模型提供了新路径。 Abstract: Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.[26] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking
Raymond Xiong,Furong Jia,Lionel Wong,Monica Agrawal
Main category: cs.CL
TL;DR: 本文构建了一个基于真实患者提问的医疗问答数据集,揭示了当前LLMs在识别患者问题中隐含的错误假设和危险意图方面存在显著缺陷。
Details
Motivation: 现有医疗领域LLM评测多依赖医学考试题,与患者实际提出的日常问题在风格和内容上差异大,缺乏贴近真实场景的评估基准。 Method: 通过Google 'People Also Ask'功能,针对美国前200种处方药采集真实用户提问,构建面向患者的真实医疗问答数据集,并分析其中错误假设与危险意图的分布规律及成因。 Result: 发现患者提问中大量存在错误假设和危险意图,且其出现并非随机,而是与前置提问中的错误程度密切相关;主流LLM在此类问题上的假设识别能力远低于其在标准医学考试题上的表现。 Conclusion: 现有LLM评测基准不能反映模型在真实患者交互场景下的安全性与可靠性,亟需构建并采用更贴近现实的评估数据集。 Abstract: Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.[27] Persona Switch: Mixing Distinct Perspectives in Decoding Time
Junseok Kim,Nakyeong Yang,Kyomin Jung
Main category: cs.CL
TL;DR: 本文提出了一种名为Persona Switch的新解码方法,通过在每一步动态选择零样本提示与角色扮演提示中置信度更高的输出,从而结合两者优势,显著提升大语言模型的零样本推理能力。
Details
Motivation: 角色扮演提示虽能提升语言模型的零样本推理能力,但效果不稳定;零样本提示与角色扮演提示可能具有互补性,而非单一方占优。 Method: 提出Persona Switch解码方法:在生成过程中每一步分别运行零样本和角色扮演提示,依据logit gap衡量输出置信度,并选择置信度更高的输出。 Result: 在多个主流大语言模型上实验表明,Persona Switch持续优于强基线方法,最高带来5.13%的准确率提升;验证了输出置信度(logit gap)是可靠的选择指标。 Conclusion: 零样本与角色扮演提示具有互补性,动态融合二者可稳定提升模型零样本推理性能,且置信度可作为有效指导信号。 Abstract: Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.[28] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He,Zongwei Lyu,Yi R Fung
Main category: cs.CL
TL;DR: 本文提出了RebuttalAgent,首个基于心理理论(ToM)的学术反驳生成框架,通过TSR流程建模审稿人心理状态、制定说服策略并生成策略驱动的回应;构建了大规模合成数据集RebuttalBench和专用评估器Rebuttal-RM,并采用两阶段训练(监督微调+自奖励强化学习),显著优于基线与商用模型。
Details
Motivation: 当前AI在学术 rebuttal 中表现不佳,因其本质是信息不对称下的战略性沟通,而非简单技术辩论;现有方法仅模仿表层语言,缺乏关键的换位思考(perspective-taking)能力。 Method: 提出ToM-Strategy-Response(TSR)三阶段框架;构建基于critique-and-refine的RebuttalBench数据集;采用两阶段训练:先监督微调以习得ToM分析与策略规划能力,再通过自奖励强化学习实现可扩展自我提升;开发专用评估器Rebuttal-RM,基于10万+多源反驳样本训练。 Result: RebuttalAgent在自动指标上平均超越基线模型18.3%,且在自动与人工评估中均优于先进闭源模型;Rebuttal-RM评估一致性超越GPT-4.1。 Conclusion: 将心理理论系统性引入学术反驳任务是可行且有效的,RebuttalAgent为AI辅助科研沟通树立了新范式,但强调其输出仅为参考,不可替代作者自身批判性回应。 Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.[29] Hallucination Mitigating for Medical Report Generation
Ruoqing Zhao,Runze Xia,Piji Li
Main category: cs.CL
TL;DR: 本文提出KERM框架,通过知识增强和细粒度强化奖励来减少大型视觉语言模型在医学报告生成中的幻觉问题,提升报告质量。
Details
Motivation: 大型视觉语言模型(LVLMs)在医学报告生成中易产生幻觉,即生成看似合理但不准确的内容,这在医学领域尤其危险。 Method: 首先利用MedCLIP进行知识检索,从知识库中提取相关病变事实;然后通过净化模块确保知识与患者临床背景相关;最后采用细粒度强化奖励引导模型生成高支持性、临床相关的描述。 Result: 在IU-Xray和MIMIC-CXR数据集上的实验表明,该方法能有效缓解幻觉并提升医学报告质量。 Conclusion: KERM框架通过知识增强与细粒度强化学习,显著提升了LVLM在医学报告生成任务中的准确性与可靠性。 Abstract: In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations'', raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient's clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model's outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.[30] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs
Tristan Williams,Franziska Weeber,Sebastian Padó,Alan Akbik
Main category: cs.CL
TL;DR: 本文提出了一种评估对齐大语言模型代表性的新框架,不仅关注边际响应分布,还考察多变量相关性模式,并发现现有方法(如角色提示和人口统计微调)虽能较好拟合边际分布,却无法准确捕捉人类价值观调查中的深层相关结构。
Details
Motivation: 现有价值对齐研究主要关注边际响应分布的对齐,忽略了真实人群中存在的潜在结构和文化价值理论所依赖的深层关联模式。 Method: 提出一种结合边际分布与多变量相关模式的代表性评估框架,并以世界价值观调查(WVS)数据为黄金标准,对比分析角色提示(persona prompting)和基于人口统计的微调(demographic fine-tuning)两种 steering 技术。 Result: 人口统计微调在边际分布上优于角色提示,但两者均未能充分复现人类响应中的相关性结构;仅依赖边际评估会掩盖结构性缺陷,导致对模型能力的过度乐观判断。 Conclusion: 代表性是价值对齐中一个独立且关键的维度,仅评估边际分布不足以保证模型真实反映人类价值观结构,需引入多变量相关性等结构性评估指标。 Abstract: Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.[31] HumanLLM: Towards Personalized Understanding and Simulation of Human Nature
Yuxuan Lei,Tianfu Wang,Jianxun Lian,Zhengyu Hu,Defu Lian,Xing Xie
Main category: cs.CL
TL;DR: 本文提出HumanLLM,一种专为个性化理解与模拟个体行为而设计的基础模型。通过构建源自Reddit、Twitter等平台的‘认知基因组数据集’(含550万用户日志),并结合多阶段数据处理与监督微调,显著提升了对用户行为、思维及写作风格的预测与模拟能力,并在跨领域社会智能评测中表现出更强泛化性。
Details
Motivation: 现有大语言模型在模拟人类行为方面受限,因其预训练数据缺乏个体决策、思维与行为的连续情境化建模,难以支撑社会科学研究与个性化应用。 Method: 构建大规模认知基因组数据集(来自Reddit、Twitter等平台的5.5M用户日志),经多阶段过滤、合成与质量控制;设计多样化学习任务,开展监督微调,使模型能预测个体化行为、思想与体验。 Result: HumanLLM在用户行为与内心想法预测、写作风格与偏好模仿、用户画像真实性等方面均优于基线模型,并在跨领域社会智能基准上展现出更强泛化能力。 Conclusion: HumanLLM通过引入情境化、个体化的训练范式,有效弥补了通用大模型在人类行为模拟上的根本性缺陷,为社会科学研究与个性化服务提供了新基础模型范式。 Abstract: Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.[32] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics
Silvia Casola,Ryan Soh-Eun Shim,Felicia Körner,Yuchen Mao,Barbara Plank
Main category: cs.CL
TL;DR: 本文探讨了通过将多语言神经评估指标的激活引导至英语作为内部枢纽语言,以提升其与人类判断的相关性。
Details
Motivation: 多语言语言模型在自然语言生成任务中广泛应用,但缺乏准确稳健的多语言评估指标;同时,研究表明这些模型常以英语为内部枢纽语言,与该枢纽的错位会导致下游性能下降,因此作者假设这种错位也可能影响多语言神经评估指标的效果。 Method: 实验采用基于编码器和解码器的多语言评估指标,并在测试时引入干预方法,将模型激活朝向英语枢纽语言进行引导。 Result: 测试时干预方法在多种语言上均有效,显著提升了各类多语言评估指标与人类判断之间的相关性。 Conclusion: 将多语言神经评估指标的内部表示对齐至英语枢纽语言,可普遍提升其评估效果,为多语言生成质量评估提供了新思路。 Abstract: An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.[33] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection
Guoxuan Ding,Yuqing Li,Ziyan Zhou,Zheng Lin,Daren Zha,Jiangnan Li
Main category: cs.CL
TL;DR: 本文提出ExDR框架,通过解释驱动的动态检索增强生成方法提升多模态假新闻检测效果,显著优于现有方法。
Details
Motivation: 多模态假新闻传播迅速且依赖时效性事实,现有检测方法难以应对;动态检索增强生成虽具潜力,但在冗余检索、相似度粗粒度和无关证据等方面仍存在问题。 Method: 提出ExDR框架:在检索触发与证据检索模块中系统利用模型生成的解释;从三个互补维度评估触发置信度;构建融合欺骗性实体的实体感知索引;基于欺骗特异性特征检索对比性证据以挑战初始声明并增强预测。 Result: 在AMG和MR2两个基准数据集上,ExDR在检索触发准确率、检索质量及整体检测性能上均持续超越先前方法。 Conclusion: ExDR框架有效提升了多模态假新闻检测的准确性与泛化能力,验证了解释驱动策略在动态检索增强生成中的关键作用。 Abstract: The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.[34] Can professional translators identify machine-generated text?
Michael Farrell
Main category: cs.CL
TL;DR: 本研究探讨未经专门训练的专业译者能否可靠识别AI生成的意大利语短篇小说。69名译者参与实验,评估三篇匿名短篇(两篇由ChatGPT-4o生成,一篇为人类作者所写),结果表明仅16.2%能显著区分AI与人类文本,低突发性(low burstiness)和叙事矛盾是最可靠的AI标识特征,而语法准确性和情感语调反而易致误判。
Details
Motivation: 探究专业译者在无专门训练前提下识别AI生成文本的能力,以评估当前AI文本在专业语境中的可检测性及编辑需求。 Method: 采用线下实验设计,69名专业译者对三篇匿名短篇(两篇ChatGPT-4o生成、一篇人类撰写)进行AI作者可能性评分并提供判断依据;结合定性分析与统计检验识别有效判别特征。 Result: 16.2%译者能显著区分AI与人类文本;低突发性与叙事矛盾是最可靠AI标识;英语直译、语义借用与句法迁移亦具提示性;而语法正确性与情感语调常导致误判;近似比例译者反向误判,反映可能的读者偏好。 Conclusion: AI生成文本虽存在可识别语言特征,但多数专业译者难以稳定识别,提示需加强合成文本编辑能力培训,并重新审视AI文本在翻译与出版等专业场景中的质量控制与伦理边界。 Abstract: This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.[35] Determinants of Training Corpus Size for Clinical Text Classification
Jaya Chaturvedi,Saniya Deshpande,Chenkai Ma,Robert Cobb,Angus Roberts,Robert Stewart,Daniel Stahl,Diana Shamsutdinova
Main category: cs.CL
TL;DR: 本文研究了临床文本分类中训练数据量与词汇特性对模型性能的影响,发现600份文档即可达到使用10,000份文档时95%的性能,并揭示强预测词和噪声词数量对学习曲线和准确率的具体影响。
Details
Motivation: 临床NLP文本分类常依赖200–500份标注文档,但该样本量缺乏对文本词汇特性与性能关系的理论依据和实证支持。 Method: 基于MIMIC-III出院记录数据集,采用BERT嵌入+随机森林分类器,在10个ICD-9诊断任务上系统改变训练规模(100–10,000),并利用Lasso逻辑回归分析词袋嵌入中的强/噪声预测词。 Result: 600份文档即可达10,000份时95%的性能;强预测词每增100个,最大准确率升约0.04;噪声词每增100个,准确率降约0.02。 Conclusion: 训练数据需求与文本词汇特性(强/噪声词比例)密切相关,可据此指导高效标注策略,避免盲目扩大标注规模。 Abstract: Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.[36] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers
Francisco Portillo López
Main category: cs.CL
TL;DR: 本研究通过McGurk效应测试AV-HuBERT模型的视听感知生物保真度,发现其在听觉主导率上与人类高度一致,但在音素融合倾向上表现出过强的确定性,缺乏人类感知的随机性与多样性。
Details
Motivation: 评估AV-HuBERT模型在多模态语音感知中是否具备类人的生物保真度,特别是对视听不一致刺激的反应机制。 Method: 以McGurk效应为范式,将AV-HuBERT模型响应与44名人类被试的行为数据进行定量对比,分析听觉主导率、音素融合率及错误模式差异。 Result: AV-HuBERT与人类在听觉主导率上几乎一致(32.0% vs. 31.8%),但在音素融合率上显著更高(68.0% vs. 47.7%),且缺乏人类表现出的感知随机性和多样化错误分布。 Conclusion: 当前自监督视听模型虽能复现多感官整合结果,但尚未建模人类神经层面的变异性,提示其感知机制仍具决定性而非生物真实性。 Abstract: This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.[37] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Chenghao Fan,Wen Heng,Bo Li,Sichen Liu,Yuxuan Song,Jing Su,Xiaoye Qu,Kai Shen,Wei Wei
Main category: cs.CL
TL;DR: 本文提出Stable-DiffCoder,一种基于块扩散的代码语言模型,在相同架构与数据下超越了自回归基线,并通过连续预训练与定制噪声调度实现稳定高效训练,提升了代码编辑、推理及低资源语言建模能力。
Details
Motivation: 现有基于扩散的代码语言模型(DLLMs)在可比预算下仍落后于强自回归(AR)基线,需重新审视并改进其训练范式与建模能力。 Method: 提出Stable-DiffCoder,复用Seed-Coder架构、数据与训练流程;引入块扩散连续预训练(CPT)阶段,并设计针对性warmup与块级裁剪噪声调度以提升知识学习效率与训练稳定性。 Result: 在相同数据与架构下,Stable-DiffCoder全面超越AR基线;仅靠CPT与监督微调即优于多种约8B参数的AR与DLLM;在代码编辑、推理及低资源语言上表现更优。 Conclusion: 扩散建模(尤其any-order与块级结构建模)不仅能媲美甚至超越AR训练效果,还能增强结构化代码理解与泛化能力,为代码生成提供新范式。 Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.[38] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech
Soufiane Jhilal,Stéphanie Martin,Anne-Lise Giraud
Main category: cs.CL
TL;DR: 本文提出一种基于图像的方法,将脑磁图(MEG)信号转换为时频图像表示,并利用ImageNet预训练视觉模型解码想象言语,在跨被试任务中取得良好性能。
Details
Motivation: 非侵入式想象言语解码面临神经信号微弱、分布广、标注数据少等挑战。 Method: 将21名被试的MEG信号通过可学习的传感器空间卷积投影为三种空间小波图混合,生成类图像输入,送入ImageNet预训练视觉模型进行分类。 Result: 在想象言语vs.静默、vs.默读、元音解码任务中分别达到90.4%、81.0%、60.6%的平衡准确率;跨被试评估验证了模型捕获共享神经表征的能力;时间分析定位到与想象事件锁时的关键判别区间。 Conclusion: 预训练视觉模型应用于图像化的MEG表示,能有效捕捉非侵入式神经信号中想象言语的结构信息。 Abstract: Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.[39] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
Özgür Uğur,Mahmut Göksu,Mahmut Çimen,Musa Yılmaz,Esra Şavirdi,Alp Talha Demir,Rumeysa Güllüce,İclal Çetin,Ömer Can Sağbaş
Main category: cs.CL
TL;DR: 本文提出了Mecellem模型框架,通过领域自适应策略开发面向土耳其法律领域的专用语言模型,包括从零预训练的高效编码器(ModernBERT架构)和经持续预训练(CPT)适配的解码器(Qwen3系列),在检索性能、计算效率与法律文本建模能力上均取得显著提升。
Details
Motivation: 现有最先进模型依赖多阶段、高计算成本的训练流程,难以兼顾效率与专业性;土耳其法律领域缺乏高质量、轻量且专用的语言模型。 Method: (1)Encoder:基于ModernBERT架构,在112.7B土耳其语语料上从零预训练双向编码器,并采用以下游检索性能为指标的检查点选择策略;(2)Decoder:对Qwen3-1.7B/4B进行四阶段可控课程学习的持续预训练(CPT),逐步融入法律术语与长上下文推理能力。 Result: Encoder在土耳其检索排行榜中位列前三,155M小模型性能媲美307M–567M大模型,生产效率达92.36%(排名第四);Decoder实现土耳其法律文本困惑度下降36.2%。 Conclusion: 单阶段高效预训练+轻量后训练的Encoder方案,以及课程驱动的CPT Decoder方案,共同构成一种计算高效、性能优越的土耳其法律领域语言模型构建范式。 Abstract: This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.[40] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
Tony Cristofano
Main category: cs.CL
TL;DR: 本文提出了一种跨模型迁移拒绝行为干预的方法,证明了对齐大语言模型中的拒绝行为源于通用的低维语义回路。
Details
Motivation: 拒绝行为在对齐的大语言模型中常被视为模型特有现象,但作者假设其源于跨模型共享的通用低维语义回路。 Method: 提出Trajectory Replay via Concept-Basis Reconstruction框架:通过概念指纹对齐层、用共享‘概念原子’重构拒绝方向,并引入weight-SVD稳定性保护机制避免损害模型能力。 Result: 在8组模型对(含GPT-OSS-20B和GLM-4)上验证了该方法能一致削弱拒绝行为且不损害性能。 Conclusion: 实验证明安全对齐具有语义普适性,拒绝行为由跨模型共享的低维语义回路驱动。 Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.[41] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating
Makbule Gulcin Ozsoy
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的多语言Text2Cypher方法,通过训练语言特定的LoRA适配器并使用学习型融合MLP进行组合,在不重新全量微调的情况下支持新语言,显著提升了数据效率与可扩展性。
Details
Motivation: 现有Text2SQL/SPARQL/Cypher系统主要面向英语,缺乏高效、可扩展的多语言支持;需避免重复全量微调和人工调参,同时保持接近联合多语言微调的性能。 Method: 为英语、西班牙语和土耳其语分别训练LoRA适配器,采用均匀线性融合或带动态门控的学习型融合MLP进行组合;支持仅新增一个LoRA适配器加轻量MLP重训练即可扩展新语言。 Result: 学习型融合MLP在三语上平均恢复约75%的联合多语言微调准确率,优于线性融合,且所需训练数据更少。 Conclusion: 学习型适配器融合为多语言Text2Cypher任务提供了一种兼顾性能、数据效率与可扩展性的实用替代方案。 Abstract: Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.[42] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier
Haq Nawaz Malik,Kh Mohmad Shafi,Tanveer Ahmad Reshi
Main category: cs.CL
TL;DR: 本文提出SynthOCR-Gen,一种面向低资源语言的开源合成OCR数据集生成工具,通过将Unicode文本语料转化为带真实退化效果的图像数据,解决了标注数据稀缺难题;以克什米尔语为例生成并公开了60万样本数据集,显著提升其OCR性能。
Details
Motivation: 低资源语言(如使用波斯-阿拉伯文字的克什米尔语)缺乏大规模标注OCR训练数据,主流OCR系统不支持,而人工标注成本高、耗时长且易错。 Method: 开发SynthOCR-Gen工具,包含文本分段(字/词/n元/句/行)、Unicode归一化与文字纯净性保障、多字体可配置渲染、以及25+种模拟真实文档退化(旋转、模糊、噪声、扫描伪影等)的数据增强技术。 Result: 成功生成并公开发布一个含60万样本的克什米尔语词级OCR合成数据集(HuggingFace),验证了该方法在提升低资源语言OCR性能上的有效性。 Conclusion: SynthOCR-Gen为低资源语言OCR提供了可扩展、低成本、高质量的合成数据生成方案,推动视觉-语言AI模型覆盖全球弱势书写系统。 Abstract: Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.[43] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
Alphaeus Dmonte,Vidhi Gupta,Daniel J Perry,Mark Arehart
Main category: cs.CL
TL;DR: 本文首次从效率角度分析了多语言多任务模型融合策略,在保持质量的同时显著降低了训练时间和维护成本。
Details
Motivation: 微调多语言大语言模型存在计算低效和维护瓶颈问题,而现有研究未系统评估模型融合策略的效率优势。 Method: 对多语言多任务模型融合策略进行聚焦的效率分析,涵盖三个独立任务,并在公开与私有工业数据集上验证。 Result: 融合方法将初始训练时间减少达50%,语言更新与再融合使维护训练成本降低超60%。 Conclusion: 多语言模型融合在保持性能的同时大幅提升计算与维护效率,适用于学术与工业场景。 Abstract: Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.[44] Automatic Classification of Arabic Literature into Historical Eras
Zainab Alhathloul,Irfan Ahmad
Main category: cs.CL
TL;DR: 本文提出使用神经网络和深度学习技术对阿拉伯语文本按历史时期进行自动分类,涵盖从史前到现代的多个时期,在二分类任务中取得了0.83和0.79的F1分数,但在多分类(如15类)任务中性能显著下降。
Details
Motivation: 阿拉伯语在历史上经历了显著演变,但目前针对阿拉伯文本按时间分期的自动分类研究较少,尤其在非诗歌领域存在研究空白。 Method: 采用神经网络和深度学习技术,基于OpenITI和APCD两个公开语料库构建数据集,开展从二分类到15类分类的多种实验设置。 Result: 二分类任务在OpenITI和APCD数据集上F1-score分别达0.83和0.79;15类和12类分类任务F1-score分别降至0.20和0.18。 Conclusion: 深度学习模型在粗粒度(如二分类)阿拉伯文本断代任务中表现良好,但细粒度多时期分类仍具挑战性,需进一步探索更有效的特征表示与建模方法。 Abstract: The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.[45] LLM-in-Sandbox Elicits General Agentic Intelligence
Daixuan Cheng,Shaohan Huang,Yuxian Gu,Huatong Song,Guoxin Chen,Li Dong,Wayne Xin Zhao,Ji-Rong Wen,Furu Wei
Main category: cs.CL
TL;DR: 本文提出LLM-in-Sandbox框架,使大语言模型能在代码沙箱中自主探索,从而在非代码任务中展现通用智能;无需额外训练即可实现跨领域泛化,并可通过仅用非具身数据的强化学习进一步提升能力。
Details
Motivation: 现有大语言模型在非代码任务中缺乏主动探索与工具调用能力,难以应对需外部知识、长上下文或格式控制等复杂场景,亟需一种能激发其通用智能的新范式。 Method: 提出LLM-in-Sandbox框架,让LLM在虚拟代码沙箱中运行;通过零样本实验验证其自发使用文件系统、调用外部API、执行脚本等能力;进一步设计LLM-in-Sandbox-RL方法,仅用非具身(non-agentic)数据进行强化学习训练。 Result: LLM-in-Sandbox在数学、物理、化学、生物医学、长上下文理解与指令遵循等多领域实现强泛化;训练-free和微调后均表现稳健;系统分析表明其计算与部署效率良好,并已开源为Python包。 Conclusion: 将LLM置于可交互的沙箱环境中是一种有效激发其通用智能的可行路径,无需依赖大量具身数据即可提升其作为智能体的自主性与适应性。 Abstract: We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.cs.CV [Back]
[46] AI-Based Culvert-Sewer Inspection
Christina Thrainer
Main category: cs.CV
TL;DR: 本文针对排水系统中涵洞和污水管道缺陷自动分割任务,提出三种应对标注数据稀缺问题的方法:改进的数据预处理策略、新型轻量级网络FORTRESS,以及基于双向原型网络的少样本语义分割方法,均在性能与效率上取得显著提升。
Details
Motivation: 涵洞和污水管道缺陷检测面临标注数据获取困难、成本高、依赖领域知识等问题,导致大规模标注数据集难以构建,亟需适用于小样本场景的高效分割方法。 Method: 提出三种方法:1)结合传统数据增强与动态标签注入的预处理策略;2)融合深度可分离卷积、自适应KAN和多尺度注意力机制的新型轻量架构FORTRESS;3)采用带注意力机制的双向原型网络实现少样本语义分割。 Result: 所提方法在涵洞/污水管道缺陷数据集上均取得优异效果:预处理策略显著提升IoU和F1分数;FORTRESS达到SOTA性能,同时大幅降低参数量与计算开销;少样本方法在有限标注下仍获得满意的各项评估指标结果。 Conclusion: 本研究有效缓解了缺陷分割任务中的数据稀缺瓶颈,通过数据增强、模型轻量化与少样本学习三条路径,提升了模型实用性、泛化性与部署可行性,为实际工程应用提供了可靠技术支撑。 Abstract: Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.[47] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition
Hatef Otroshi Shahreza,Anjith George,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在异构人脸识别(HFR)任务中的性能,涵盖VIS-NIR、VIS-SWIR和VIS-THERMAL等跨模态场景,发现其性能仍显著落后于传统方法,凸显当前MLLMs在生物特征识别应用中的局限性。
Details
Motivation: 探索多模态大语言模型(MLLMs)在异构人脸识别(HFR)这一具有挑战性的生物特征任务中的适用性与潜力。 Method: 对多个开源MLLMs在多种跨模态人脸匹配场景(VIS-NIR、VIS-SWIR、VIS-THERMAL)下进行系统基准测试,采用标准生物特征协议及指标(Acquire Rate、EER、TAR)评估性能。 Result: MLLMs在跨光谱HFR任务中表现远逊于经典人脸识别系统,尤其在困难条件下差距显著;现有MLLMs尚不具备直接部署于实际生物识别系统的可靠性。 Conclusion: 当前MLLMs在异构人脸识别任务中存在明显局限,需更严格的生物特征导向评估与针对性改进,不可直接替代传统方法。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.[48] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Pablo Messina,Andrés Villa,Juan León Alcázar,Karen Sánchez,Carlos Hinojosa,Denis Parra,Álvaro Soto,Bernard Ghanem
Main category: cs.CV
TL;DR: CURE是一种无需额外数据的错误感知课程学习框架,通过动态调整采样策略提升医学视觉-语言模型在放射报告生成中的视觉定位准确性和事实一致性。
Details
Motivation: 现有医学视觉-语言模型在放射报告生成中存在视觉定位不准和事实不一致问题,常导致文本发现与图像证据错位,影响预测可靠性。 Method: CURE基于多模态指令模型,在公开数据集上依次进行短语定位、定位报告生成和解剖结构定位报告生成的微调;采用根据模型性能动态调整难度样本采样的课程学习策略,以增强空间与文本对齐能力。 Result: CURE将定位准确率(IoU)提升0.37,报告质量(CXRFEScore)提升0.188,幻觉率降低18.6%。 Conclusion: CURE是一种数据高效框架,显著提升了医学报告生成的定位准确性与报告可靠性,且无需额外标注数据。 Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure[49] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
Cuong Tran Van,Trong-Thang Pham,Ngoc-Son Nguyen,Duy Minh Ho Nguyen,Ngan Le
Main category: cs.CV
TL;DR: 本文提出DuFal框架,通过双路径架构(频域+空域)解决稀疏视角锥束CT重建中高频细节丢失问题,核心是高频增强的傅里叶神经算子与谱-通道分解、跨注意力频域融合等模块,显著提升稀疏视角下解剖结构重建质量。
Details
Motivation: 稀疏视角锥束CT重建因X射线投影严重欠采样,导致高频解剖细节(如细微结构)难以恢复;传统CNN方法偏向低频学习,难以重建高频率成分。 Method: 提出DuFal(Dual-Frequency-Aware Learning)双频感知学习框架:1)双路径架构,融合频域与空域处理;2)High-Local Factorized Fourier Neural Operator,含全局高频增强FNO(捕获全局频谱模式)和局部高频增强FNO(分块处理保留空间局部性);3)Spectral-Channel Factorization降低FNO参数量;4)Cross-Attention Frequency Fusion模块融合空频特征;5)Feature Decoder生成投影表征,再经Intensity Field Decoding重建CT体数据。 Result: 在LUNA16和ToothFairy数据集上实验表明,DuFal在极稀疏视角设置下显著优于现有SOTA方法,尤其在高频解剖特征(如边缘、纹理)保真度方面表现突出。 Conclusion: DuFal通过协同建模空域局部性与频域全局结构,有效缓解稀疏CT重建中的高频信息丢失问题,为临床低剂量成像提供了高性能、高保真的新范式。 Abstract: Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.[50] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection
Morteza Poudineh,Marc Lalonde
Main category: cs.CV
TL;DR: 本文提出了一种基于偏差引导的提示学习框架,用于少样本正常图像下的异常检测(FNSAD),通过可学习提示和基于偏差的打分机制提升像素级异常定位性能。
Details
Motivation: 现有方法在少样本异常检测中对正常与异常提示区分能力弱,且缺乏有效的补丁级异常评分机制。 Method: 采用可学习上下文向量替代固定提示前缀,并引入异常特异性后缀;结合Top-K多实例学习的偏差损失,将补丁特征建模为正态分布的高斯偏差。 Result: 在MVTecAD和VISA数据集上像素级检测性能优于PromptAD等基线方法;消融实验验证了可学习提示、偏差评分及Top-K MIL策略的有效性。 Conclusion: 该框架有效融合视觉语言模型的语义能力与偏差统计的可靠性,显著提升了少样本设定下异常检测的判别力、定位精度与可解释性。 Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.[51] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
Yunshan Qi,Lin Zhu,Nan Bao,Yifan Zhao,Jia Li
Main category: cs.CV
TL;DR: 本文提出了一种基于传感器物理模型的NeRF框架,利用单次曝光的低动态范围(LDR)模糊图像及其对应事件数据,实现锐利、高动态范围(HDR)的新视角合成。
Details
Motivation: 现有方法在利用事件数据进行LDR模糊图像的新视角合成时,忽略了相机输出与真实世界辐射之间的传感器物理不匹配问题,导致HDR重建和去模糊效果不佳。 Method: 提出统一的传感器物理驱动NeRF框架:用NeRF直接表征HDR场景辐射;建模HDR光线与传感器像素的物理交互;引入像素级RGB映射场对齐渲染值与输入LDR图像;设计事件映射场关联场景动态与事件传感器输出;联合优化两个映射场与NeRF。 Result: 在自建与公开数据集上的实验表明,该方法在仅使用单次曝光LDR模糊图像及对应事件数据的前提下,实现了SOTA级别的去模糊HDR新视角合成效果。 Conclusion: 传感器物理建模对于提升基于事件辅助的HDR新视角合成质量至关重要,所提框架有效融合了事件的时间动态信息与NeRF的空间几何建模能力。 Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.[52] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis
Jobeal Solomon,Ali Mohammed Mansoor Alsahag,Seyed Sahand Mohammadi Ziabari
Main category: cs.CV
TL;DR: 本文提出用Vision Transformer替代U-Net作为Attribute-Neutral Framework的编码器,以更有效地消除胸部X光片分类器中与性别和年龄相关的偏见,同时保持诊断准确性。实验表明,ViT中性化器在中等编辑强度下显著降低性别识别AUC(降至约0.80),且疾病预测性能下降可控。
Details
Motivation: 胸部X光AI模型常因利用性别和年龄等捷径特征而产生偏差,导致少数群体系统性漏诊;已有基于卷积编码器的属性中性化方法未能充分消除此类属性泄露。 Method: 将Attribute-Neutral Framework中的U-Net卷积编码器替换为数据高效训练的Image Transformer Small(DeiT-S)Vision Transformer,并在ChestX-ray14数据集上训练;生成11个编辑强度级别的中性化图像,用独立AI判别器评估属性泄露(如性别识别AUC),并用CNN评估15种疾病诊断性能(宏观ROC AUC及亚组AUC)。 Result: 在alpha=0.5中等编辑强度下,ViT中性化器将性别识别AUC降至约0.80(较原U-Net框架低约10个百分点),且仅训练一半epoch;15种疾病的宏观ROC AUC下降不超过5个百分点,最差亚组AUC仍维持在约0.70。 Conclusion: Vision Transformer凭借全局自注意力机制,能在不明显损害临床诊断性能的前提下,进一步抑制人口统计学属性泄露,为构建更公平的胸部X光AI提供可行路径。 Abstract: Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework's convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.[53] Controllable Layered Image Generation for Real-World Editing
Jinrui Yang,Qing Liu,Yijun Li,Mengwei Ren,Letian Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou
Main category: cs.CV
TL;DR: 本文提出LASAGNA框架,用于联合生成具有真实背景和高质量透明前景(含阴影、反射等视觉效果)的分层图像,支持多种条件输入(文本、前景/背景/位置掩码),并发布新数据集LASAGNA-48K和首个分层编辑基准LASAGNABENCH。
Details
Motivation: 现有图像生成模型在编辑图像特定元素时缺乏可控性和一致性;分层表示虽具灵活性,但难以生成具备合理合成关系及真实视觉效果(如阴影、反射)的对象层。 Method: 提出统一框架LASAGNA,联合生成背景与带物理真实视觉效果的RGBA前景层;构建LASAGNA-48K数据集(含干净背景与带真实效果的RGBA前景);设计首个分层编辑基准LASAGNABENCH;支持多模态条件输入(文本、掩码等)以提升可控性。 Result: LASAGNA能同时生成高度一致、连贯的多层图像,显著提升身份保持与视觉效果保真度,支持多样化后编辑应用;LASAGNA-48K与LASAGNABENCH将开源。 Conclusion: LASAGNA为可控、可编辑的分层图像生成提供了新范式,通过数据、方法与评测三方面推动该方向发展,并促进社区开放研究。 Abstract: Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.[54] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
William Huang,Siyou Pei,Leyi Zou,Eric J. Gonzalez,Ishan Chatterjee,Yang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种利用背侧手部皮肤形变信息的双流delta编码器方法,显著提升了自遮挡场景下的 egocentric 手势估计精度,同时减小模型尺寸并支持新型交互范式。
Details
Motivation: XR设备普及使得以自我为中心的手势估计变得重要,但手指频繁遮挡带来挑战,现有方法依赖完整手部几何和大模型,难以应对严重遮挡。 Method: 提出基于背侧手部图像的双流delta编码器,通过对比动态手势与松弛基准姿态的视觉特征来学习手部姿态。 Result: 在手指遮挡率≥50%的自遮挡场景下,MPJAE较SOTA降低18%;仅用裁剪的背侧图像即实现高精度估计,并支持指尖捏取、轻触及等长按压检测等新交互。 Conclusion: 该方法在提升遮挡鲁棒性的同时减小模型规模,拓展了XR中低延迟、高可靠手势交互的应用边界。 Abstract: The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.[55] VIOLA: Towards Video In-Context Learning with Minimal Annotations
Ryo Fujii,Hideo Saito,Ryo Hachiuma
Main category: cs.CV
TL;DR: 本文提出VIOLA框架,通过密度-不确定性加权采样和置信度感知检索/提示,实现仅需极少专家标注即可高效适配多模态大模型到新视频领域。
Details
Motivation: 现有视频领域多模态大语言模型泛化能力受限于标注数据稀缺,尤其在工业、手术等专业场景中难以获取大量专家标注;而传统上下文学习依赖大规模标注池,不切实际。 Method: 提出VIOLA框架:1)密度-不确定性加权采样,在极低标注预算下选取兼具多样性、代表性与信息量的样本;2)构建混合示范池,结合置信度感知检索(融合相似性与置信度打分)和置信度感知提示(使模型区分真实标签与噪声伪标签)。 Result: 在9个视频基准、4种MLLM上的实验表明,VIOLA在低资源场景下显著优于各类基线,以极低标注成本实现鲁棒适应。 Conclusion: VIOLA为多模态大模型在标注稀缺的专业视频场景中提供了高效、实用的免训练适配新范式。 Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.[56] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation
Sylvey Lin,Eranki Vasistha
Main category: cs.CV
TL;DR: 本文提出了一种新的评估指标RCA,用于衡量类条件DDPM在K-pop偶像人脸生成任务中的语义可控性,发现模型虽视觉质量高(FID=8.93),但存在严重语义模式坍缩(RCA=0.27),尤其在视觉模糊身份上,并归因于分辨率限制与性别内歧义。
Details
Motivation: 标准评估指标(如FID、IS)难以检测细粒度单领域任务(如K-pop偶像人脸生成)中的身份错位问题,亟需更语义敏感的评估方法。 Method: 提出相对分类准确率(RCA)指标,通过将生成样本的分类准确率相对于理想分类器基线进行归一化;在32x32分辨率K-pop偶像人脸数据集上评估Class-Conditional DDPM,并结合混淆矩阵分析失败模式。 Result: 模型FID为8.93(视觉质量高),但RCA仅为0.27,表明严重语义模式坍缩;错误主要集中于视觉相似偶像之间,受分辨率和性别内身份歧义影响显著。 Conclusion: RCA为条件生成模型的身份一致性验证提供了严格、可解释的评估标准,揭示了高保真生成与细粒度语义控制之间的关键权衡。 Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier's baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.[57] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition
Weiwei Wu,Yueyang Li,Yuhu Shi,Weiming Zeng,Lang Qin,Yang Yang,Ke Zhou,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang
Main category: cs.CV
TL;DR: 本文提出RSM-CoDG框架,结合脑区先验、多尺度时序建模与协同域泛化策略,提升跨被试EEG情绪识别的鲁棒性与泛化能力。
Details
Motivation: 跨被试EEG情绪识别面临被试间变异性大、分布偏移严重及情绪神经表征时空复杂度高的挑战,现有方法难以在统一框架中兼顾跨被试对齐、多尺度动态建模与去偏泛化。 Method: 提出Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization(RSM-CoDG):1)基于功能脑区分区构建区域级空间表征;2)采用多尺度时序建模刻画情绪神经活动动态演化;3)引入多维约束的协同域泛化策略抑制被试特异性偏差。 Result: 在SEED系列数据集上显著优于现有方法,验证了其在未知被试场景下的强泛化能力与鲁棒性。 Conclusion: RSM-CoDG通过融合神经科学先验与协同域泛化,在统一框架中有效缓解跨被试差异,为EEG情绪识别提供了可推广的新范式。 Abstract: Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at https://github.com/RyanLi-X/RSM-CoDG.[58] Explainable Deepfake Detection with RL Enhanced Self-Blended Images
Ning Jiang,Dingheng Zeng,Yanhong Liu,Haiyang Yi,Shijie Yu,Minghe Weng,Haifeng Shen,Ying Li
Main category: cs.CV
TL;DR: 本文提出了一种基于自混合图像的自动化思维链(CoT)数据生成框架和强化学习(RL)增强的深度伪造检测框架,以解决多模态大语言模型在可解释深度伪造检测中因高质量标注数据稀缺而受限的问题。
Details
Motivation: 现有深度伪造检测方法缺乏可解释性;多模态大语言模型(MLLMs)虽有潜力,但受限于高成本、难获取的细粒度伪造归因文本标注;强化学习在视觉任务尤其是跨域泛化中展现出优势,值得探索。 Method: 提出基于自混合图像(Self-Blended Images)的自动化Chain-of-Thought(CoT)数据生成框架,结合强化学习增强的检测框架,包含定制化奖励机制与反馈驱动的合成数据生成。 Result: 所提CoT数据构建流程、奖励机制与反馈式合成数据生成方法被大量实验验证有效;在多个跨数据集基准上达到与当前最优(SOTA)方法相当的性能。 Conclusion: 该方法在降低标注成本的同时提升了MLLM在深度伪造检测中的可解释性与泛化能力,验证了RL在该任务中的实用价值。 Abstract: Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.[59] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception
Bo Yuan,Danpei Zhao,Wentao Li,Tian Li,Zhiguo Jiang
Main category: cs.CV
TL;DR: 本文提出持续全景感知(CPP)框架,将多模态与多任务持续学习结合,通过跨模态协同编码器、可塑知识继承模块、跨模态一致性约束及非对称伪标签策略,缓解灾难性遗忘与语义混淆问题,提升像素级、实例级与图像级联合感知能力。
Details
Motivation: 现有持续学习主要聚焦单任务场景,难以应对多任务与多模态下的语义混淆与灾难性遗忘问题,限制了智能感知系统的全面性。 Method: 提出CPP端到端模型:包含协同跨模态编码器(CCE)、基于对比特征蒸馏与实例蒸馏的可塑知识继承模块、跨模态一致性约束机制,以及无需示例回放的非对称伪标签策略。 Result: 在多模态数据集与多样化持续学习任务上实验表明,该模型尤其在细粒度CL任务中性能优越。 Conclusion: CPP有效统一了多模态与多任务持续学习,提升了全景感知能力,为构建鲁棒、可演化的智能视觉系统提供了新范式。 Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.[60] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction
Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao
Main category: cs.CV
TL;DR: 本文提出了SuperOcc框架,用于基于超二次曲面的3D占据预测,通过协同时间建模、多超二次曲面解码和高效超二次曲面到体素的溅射方法,解决了现有方法在时间建模、稀疏性与几何表达力权衡以及计算效率方面的不足,在SurroundOcc和Occ3D基准上实现了SOTA性能。
Details
Motivation: 现有3D占据预测方法多采用密集场景表示,忽略了真实驾驶场景的固有稀疏性;而新兴的3D超二次曲面表示虽具强几何表达力且稀疏,但仍存在时间建模不足、查询稀疏性与几何表达力难以兼顾、超二次曲面到体素溅射效率低等问题。 Method: 提出SuperOcc框架,包含三个核心设计:(1) 协同时间建模机制,同时利用以视角为中心和以物体为中心的时间线索;(2) 多超二次曲面解码策略,在不牺牲查询稀疏性的前提下增强几何表达力;(3) 高效的超二次曲面到体素溅射方案,提升计算效率。 Result: 在SurroundOcc和Occ3D基准上取得SOTA性能,同时保持更高的计算效率。 Conclusion: SuperOcc有效克服了现有超二次曲面方法的关键缺陷,在精度与效率间取得更好平衡,验证了稀疏几何表示在3D占据预测中的潜力。 Abstract: 3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at https://github.com/Yzichen/SuperOcc.[61] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams
Zhenghui Guo,Yuanbin Man,Junyuan Sheng,Bowen Lin,Ahmed Ahmed,Bo Jiang,Boyuan Zhang,Miao Yin,Sian Jin,Omprakash Gnawal,Chengming Zhang
Main category: cs.CV
TL;DR: 本文提出了Event-VStream,一种事件感知的视频流理解框架,通过检测语义连贯的事件边界来触发语言生成,并将事件嵌入存入持久化记忆库,从而在低延迟下实现长时序推理。
Details
Motivation: 现有视频流理解方法存在冗余帧处理和快速遗忘历史上下文的问题,固定间隔解码或缓存剪枝策略易导致重复输出或丢失关键时序信息。 Method: Event-VStream通过融合运动、语义和预测线索检测有意义的状态转换,将连续视频表示为离散且语义连贯的事件序列;仅在事件边界触发语言生成,并将每个事件嵌入整合进持久化记忆库以支持长程推理。 Result: 在OVOBench-Realtime上比VideoLLM-Online-8B基线提升+10.4分;性能接近专用模型Flash-VStream-7B,但仅使用通用LLaMA-3-8B文本主干;在2小时Ego4D流上保持约70% GPT-5胜率。 Conclusion: Event-VStream有效缓解了实时长视频理解中的冗余与遗忘问题,在保持低延迟的同时提升了长时序建模能力与实际性能。 Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.[62] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling
Hongyang Wei,Hongbo Liu,Zidong Wang,Yi Peng,Baixin Xu,Size Wu,Xuying Zhang,Xianglong He,Zexiang Liu,Peiyu Wang,Xuchen Song,Yangguang Li,Yang Liu,Yahui Zhou
Main category: cs.CV
TL;DR: 本文提出了Skywork UniPic 3.0,一个统一的多模态框架,支持单图编辑与多图合成(尤其聚焦于人-物交互HOI任务),通过新型数据流程、序列建模范式及加速推理策略,在性能与速度上均达到SOTA。
Details
Motivation: 社区对多图合成(如Nano-Banana、Seedream 4.0)兴趣高涨,但现有方法缺乏高质量融合的具体技术细节;统计发现HOI是最受关注的类别,亟需针对性解决方案。 Method: 构建统一多模态框架Skywork UniPic 3.0;设计面向HOI的多图合成数据收集、过滤与合成流程;将多图合成为序列建模问题;引入轨迹映射与分布匹配实现8步高保真生成。 Result: 在单图编辑基准上达SOTA,在多图合成基准上超越Nano-Banana和Seedream 4.0;仅用700K高质量样本即获强性能;推理速度提升12.5倍(8步生成)。 Conclusion: 所提出的统一框架、HOI导向的数据策略与序列化训练范式,有效解决了多图合成中的一致性与质量难题,为该任务提供了可扩展、高效且开源的新基准。 Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.[63] Consistency-Regularized GAN for Few-Shot SAR Target Recognition
Yikui Zhai,Shikuang Liu,Wenlve Zhou,Hongsheng Zhang,Zhiheng Zhou,Xiaolin Tian,C. L. Philip Chen
Main category: cs.CV
TL;DR: 本文提出了一种一致性正则化生成对抗网络(Cr-GAN),用于在极少量SAR图像样本下合成高质量、多样性数据,以支持小样本识别任务。其核心是双分支判别器与双域循环一致性机制,在数据稀缺条件下实现稳定训练并提升下游自监督学习性能。
Details
Motivation: SAR图像小样本识别受限于极端数据稀缺;传统GAN需大量数据训练,与小样本前提矛盾,亟需一种能在极少样本下稳定生成高质量数据的新生成框架。 Method: 提出Cr-GAN:包含双分支判别器(解耦对抗训练与表征学习)、通道级特征插值(生成新潜在特征)和双域循环一致性机制(保障语义完整性);可适配多种GAN架构,并用于增强多种自监督学习算法。 Result: 在MSTAR和SRSDD数据集8-shot设置下分别达到71.21%和51.64%准确率,显著超越主流基线;参数量仅为先进扩散模型的约1/5。 Conclusion: Cr-GAN有效缓解了小样本SAR识别中生成模型对大数据依赖的悖论,为数据稀缺场景下的生成建模与下游学习提供了新范式。 Abstract: Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: https://github.com/yikuizhai/Cr-GAN.[64] Performance-guided Reinforced Active Learning for Object Detection
Zhixuan Liang,Xingyu Zeng,Rui Zhao,Ping Luo
Main category: cs.CV
TL;DR: 本文提出了一种面向目标检测任务的性能引导型强化主动学习方法MGRAL,以mAP提升为奖励信号,利用强化学习代理选择最具信息量的样本,同时采用无监督快速查表法降低计算开销,在PASCAL VOC和COCO上取得了最优主动学习效果。
Details
Motivation: 现有主动学习方法评估数据信息量时多关注数据分布或内在信息,未直接关联下游任务性能(如目标检测中的mAP),导致选样与实际性能提升脱节。 Method: 提出MGRAL框架:以模型输出变化期望作为信息量度量;设计基于策略梯度的强化学习采样代理,以mAP提升为奖励优化批量选样;引入无监督的快速查表法近似mAP,降低计算成本。 Result: 在PASCAL VOC和COCO数据集的目标检测任务中,MGRAL实现了当前最优的主动学习曲线,并提供了具说服力的可视化结果。 Conclusion: MGRAL首次将mAP性能提升显式建模为主动学习的优化目标,开创了强化学习驱动的主动目标检测新范式。 Abstract: Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data's distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL's active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.[65] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
Mingyu Yu,Lana Liu,Zhehao Zhao,Wei Wang,Sujuan Qin
Main category: cs.CV
TL;DR: 本文提出了一种名为Beyond Visual Safety(BVS)的图像-文本对越狱框架,用于探测多模态大语言模型(MLLMs)的视觉安全边界,通过‘重建-生成’策略实现高达98.21%的越狱成功率,揭示了当前MLLMs在视觉安全对齐方面的关键漏洞。
Details
Motivation: 现有研究对MLLMs的安全漏洞已有探索,但对其视觉安全边界的探究仍不足,亟需系统性方法评估其视觉内容安全性。 Method: 提出BVS框架,采用'重建-生成'策略,结合中性化视觉拼接与归纳式重组技术,将恶意意图从原始输入中解耦,诱导MLLMs生成有害图像。 Result: BVS在GPT-5(2026年1月12日发布版)上实现了98.21%的越狱成功率,显著高于现有方法。 Conclusion: 当前MLLMs在视觉安全对齐方面存在严重缺陷,BVS揭示了其在处理图像-文本联合输入时的脆弱性,为后续安全加固提供了重要依据。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.[66] Enhanced LULC Segmentation via Lightweight Model Refinements on ALOS-2 SAR Data
Ali Caglayan,Nevrez Imamoglu,Toru Kouyama
Main category: cs.CV
TL;DR: 本文提出了一种基于ALOS-2 SAR数据的日本全国尺度土地利用/覆盖语义分割方法,通过三项轻量级改进缓解SAR图像密集预测中的边界模糊、细长结构遗漏和长尾类别性能下降等问题,并在LULC和水体检测任务上取得一致提升。
Details
Motivation: 解决SAR图像在密集预测任务中常见的边界过度平滑、细长结构遗漏以及长尾标签下稀有类别性能退化等问题,同时不增加流程复杂度。 Method: 在SAR-W-MixMAE自监督预训练基础上,引入三项轻量级改进:(i) 将高分辨率特征注入多尺度解码器;(ii) 设计渐进式 refine-up 解码头,交替进行卷积精炼与逐步上采样;(iii) 引入 α-缩放因子调节 focal+dice 损失中的类别重加权。 Result: 在全日本ALOS-2 LULC基准测试中实现一致性能提升,尤其改善了欠表示类别的分割效果,并在标准评估指标下提升了水体检测精度。 Conclusion: 所提方法在不增加模型复杂度的前提下,有效缓解了SAR图像语义分割的关键挑战,为国家尺度遥感地物精细识别提供了实用且可扩展的技术路径。 Abstract: This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $α$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.[67] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework
Shubham Shukla,Kunal Sonalkar
Main category: cs.CV
TL;DR: 本文提出了一种三层次评估框架,用于系统评测视觉-语言模型(VLMs)在细粒度时尚属性预测任务中的表现,特别关注属性适用性检测与分类能力的解耦分析;实验表明零样本VLM显著优于传统嵌入+分类器方法,但在适用性检测(NA类识别)上存在明显瓶颈,而高效型VLM可实现接近旗舰模型的性能。
Details
Motivation: 时尚属性具有条件性(如‘外层面料’仅在外衣可见时才有意义),现有VLM缺乏对属性是否适用(NA)的系统评估,亟需解耦适用性检测与细粒度分类能力。 Method: 构建三层次评估框架:(1) 整体任务性能(含NA类),(2) 属性适用性检测(是否为NA),(3) 可判定属性下的细粒度分类;在DeepFashion-MultiModal数据集上,对比9个VLM(含GPT-5、Gemini 2.5系列)与基于Fashion-CLIP嵌入的监督分类器。 Result: 零样本VLM达64.0% macro-F1(是基线方法的三倍);细粒度分类(Tier 3)达70.8% F1,但适用性检测(Tier 2)仅34.1% NA-F1;高效模型(如GPT-5 Mini)性能达旗舰模型90%以上。 Conclusion: 属性适用性检测是当前VLM在时尚多属性预测中的关键瓶颈;所提三层次框架可精准定位错误来源(可见性判断 or 分类错误),为实际系统优化提供诊断依据;高效VLM具备高性价比部署潜力。 Abstract: Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.[68] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
Chenglin Li,Qianglong Chen,Feng Han,Yikun Wang,Xingxi Yin,Yan Gong,Ruilin Li,Yin Zhang,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出VideoThinker,一种完全基于合成工具交互轨迹训练的代理式视频大语言模型,通过在字幕空间生成多步工具使用序列并将其映射回视频帧,构建大规模视频-工具推理数据集,从而克服长视频理解中静态采样导致的信息丢失问题,显著提升长视频时序定位与动态推理能力。
Details
Motivation: 现有视频大语言模型依赖均匀采样帧进行静态推理,难以精准时序定位且在长视频中信息损失严重;而构建具备代理能力(如时间检索、空间/时间缩放)的视频理解数据又需模型本身已具备强长视频理解能力,形成循环依赖。 Method: 提出VideoThinker:先将视频转为丰富字幕,利用强代理式语言模型在字幕空间生成多步工具使用序列,再将字幕替换为对应视频帧,构建大规模视频-工具交错推理数据集;全程无需真实长视频理解能力即可合成训练数据。 Result: VideoThinker在长视频基准测试中显著优于纯字幕语言模型代理及强视频模型基线,展现出动态推理、自适应时间探索和多步工具使用能力。 Conclusion: 工具增强的合成数据与自适应检索+缩放推理范式,可有效突破长视频理解瓶颈,为代理式视频理解提供新路径。 Abstract: Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.[69] FAIR-ESI: Feature Adaptive Importance Refinement for Electrophysiological Source Imaging
Linyong Zou,Liang Zhang,Xiongfei Wang,Jia-Hong Gao,Yi Sun,Shurong Sheng,Kuntao Xiao,Wanli Yang,Pengfei Teng,Guoming Luan,Zhao Lv,Zikang Xu
Main category: cs.CV
TL;DR: 本文提出FAIR-ESI框架,通过多视角自适应特征重要性精炼(频谱、时域、patch-wise)提升脑电源成像精度,并在仿真与临床数据上验证有效性。
Details
Motivation: 电生理源成像(ESI)是诊断脑部疾病的关键技术,但特征的准确选择与精炼仍是实现高精度ESI的核心挑战。 Method: 提出FAIR-ESI框架,包含FFT频谱特征精炼、加权时域特征精炼和自注意力驱动的patch-wise特征精炼,实现跨视图的自适应特征重要性优化。 Result: 在两个仿真数据集(不同配置)和两个真实临床数据集上实验验证,显著提升ESI精度,展现出对脑疾病诊断与脑功能研究的实用价值。 Conclusion: FAIR-ESI为ESI提供了可解释、自适应的多视角特征精炼范式,有望推动精准脑疾病诊断与神经机制解析。 Abstract: An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework's efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.[70] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation
Shadi Alijani,Fereshteh Aghaee Meibodi,Homayoun Najjaran
Main category: cs.CV
TL;DR: 本文提出了一种用于多模态医学影像的foundation model适配新框架,包含子区域感知模态注意力和自适应提示工程,显著提升了BraTS 2020数据集(尤其是坏死核心区域)的分割精度。
Details
Motivation: 现有基础模型在多模态医学影像中难以有效融合多源信息并适应病理组织的异质性。 Method: 提出子区域感知模态注意力机制与自适应提示工程,使模型能为每个肿瘤子区域动态选择最优模态组合,并利用基础模型固有能力提升分割精度。 Result: 在BraTS 2020数据集上显著优于基线方法,尤其在坏死核心子区域分割性能提升明显。 Conclusion: 该框架为多模态医学影像中的基础模型适配提供了原理清晰且高效可行的新范式。 Abstract: The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.[71] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework
Xinjue Hu,Chi Wang,Boyu Wang,Xiang Zhang,Zhenshan Tan,Zhangjie Fu
Main category: cs.CV
TL;DR: 本文提出ARDIS,首个任意分辨率深度图像隐写框架,通过频域解耦架构和隐式分辨率编码策略,解决了传统方法中因分辨率不一致导致的细节丢失和盲恢复难题,显著提升了隐写不可见性和跨分辨率恢复保真度。
Details
Motivation: 现有深度图像隐写方法要求秘密图像与载体图像分辨率一致,导致不同分辨率的秘密图像需预采样(造成细节损失)且无法在未知原始分辨率时准确恢复。 Method: 提出ARDIS框架:1)隐藏阶段采用频域解耦架构,将秘密图像分解为分辨率对齐的全局基和分辨率无关的高频隐变量;2)恢复阶段使用潜在引导的隐式重建器,通过高频隐码调制连续隐函数以渲染高频残差;3)引入隐式分辨率编码策略,将离散分辨率映射为稠密特征图并嵌入特征冗余空间,实现盲恢复。 Result: ARDIS在不可见性和跨分辨率恢复保真度上显著优于当前最优方法。 Conclusion: ARDIS成功将深度图像隐写范式从离散映射转向参考引导的连续信号重建,首次实现了任意分辨率下的高质量、盲式秘密图像隐藏与恢复。 Abstract: Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret's resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.[72] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification
Yimin Zhu,Lincoln Linlin Xu,Zhengsen Xu,Zack Dewis,Mabel Heffring,Saeid Taleghanidoozdoozan,Motasem Alkayid,Quinn Ledingham,Megan Greenwood
Main category: cs.CV
TL;DR: 本文提出了一种物理光谱感知的白盒超连接框架ES-mHC,用于高光谱图像分类,通过结构化、有向矩阵显式建模不同电磁波谱分组间的交互,提升模型可解释性与内部机制理解。
Details
Motivation: 现有深度学习模型在高光谱图像分类中依赖不透明的光谱-空间特征混合,导致可解释性差、内部决策机制难以理解。 Method: 提出ES-mHC框架,将特征表示与交互结构分离,利用残差流中的超连接(hyper-connection)矩阵显式建模电磁波谱分组间的定向交互,并支持可视化与空间分析。 Result: 实验表明学习到的超连接矩阵呈现一致的空间模式和非对称交互行为;扩展率提高可加速结构化交互模式的出现。 Conclusion: ES-mHC将高光谱图像分类从纯黑箱预测转变为结构透明、部分白箱的学习过程,增强了模型的可解释性与物理可理解性。 Abstract: In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.[73] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)
Qi Zeng,Weide Liu,Bo Li,Ryne Didier,P. Ellen Grant,Davood Karimi
Main category: cs.CV
TL;DR: FeTal-SAM 是一种基于 SAM 的新方法,结合图谱引导的密集提示与边界框提示,实现灵活、无需重训练的胎儿脑 MRI 结构分割,兼顾精度与泛化性。
Details
Motivation: 解决传统深度学习方法在胎儿脑 MRI 分割中依赖大量标注数据、难以适应标签定义变化,以及无法区分分割结果是源于图像对比度还是空间先验的问题。 Method: 将多图谱配准生成的空间对齐标签模板作为密集提示,联合边界框提示输入 SAM 解码器,进行逐结构二值分割,再融合重建完整 3D 分割体。 Result: 在 dHCP 和内部数据集上表现稳健,对皮层板、小脑等高对比度结构达到与专用训练模型相当的 Dice 分数;可灵活分割任意用户指定解剖结构,但对海马、杏仁核等低对比度结构精度略低。 Conclusion: FeTal-SAM 是一种无需频繁重训练、具备临床适应潜力的通用型胎儿脑 MRI 分割框架,推动了基础模型在胎儿影像分析中的实用化进展。 Abstract: This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM's segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM's robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM's potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.[74] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps
Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Jingrui Zhang,Kefei Qian,Wenbo Chu,Keqiang Li
Main category: cs.CV
TL;DR: 本文提出LL-GaussianMap,首个将2D高斯溅射(2DGS)引入低光图像增强的无监督框架,通过显式结构建模生成增益图,在保持边缘、抑制伪影的同时避免配对数据依赖。
Details
Motivation: 现有低光增强方法多在像素域或隐式特征空间操作,忽视图像固有几何结构先验;而2DGS虽具强结构拟合与高效渲染能力,却尚未用于低层视觉任务。 Method: 提出两阶段无监督框架:第一阶段用2DGS进行高保真结构重建;第二阶段通过高斯光栅化机制渲染数据驱动的增强字典系数,实现结构感知的增益图生成。 Result: 在保持优异增强性能的同时显著降低存储开销,实验验证其在边缘保持与伪影抑制上的优势,并证明显式高斯表征对图像增强的有效性。 Conclusion: LL-GaussianMap首次成功将2DGS引入低光增强任务,表明显式几何结构建模可有效提升低层视觉任务性能,且无需配对训练数据。 Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.[75] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting
Yuhan Chen,Wenxuan Yu,Guofa Li,Yijun Xu,Ying Fang,Yicui Shi,Long Cao,Wenbo Chu,Keqiang Li
Main category: cs.CV
TL;DR: 本文提出LL-GaussianImage,首个在2D高斯泼溅(2DGS)压缩表示域内直接进行零样本无监督低光增强的框架,避免传统解压-增强-再压缩流程,实现‘压缩即增强’与‘重建即增强’。
Details
Motivation: 现有低光增强方法主要在像素域操作,处理2DGS压缩图像需繁琐的解压-增强-再压缩流程,效率低且引入二次失真;亟需直接在压缩表示域中高效、保质地增强。 Method: 1)语义引导的MoE增强框架:利用渲染图像指导,在2DGS稀疏属性空间上动态自适应变换;2)多目标协同损失函数:约束平滑性与保真度,抑制伪影;3)两阶段优化:单尺度重建保障基础表示精度,提升网络鲁棒性。 Result: 在保持高压缩比的同时,实现了高质量低光图像增强;实验验证了直接在压缩表示域处理的可行性与优越性。 Conclusion: LL-GaussianImage开创了在显式压缩表示(2DGS)域内端到端低光增强的新范式,兼顾效率、质量与压缩率,为图像压缩与增强联合优化提供了新思路。 Abstract: 2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.[76] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation
Liuyun Jiang,Yanchao Zhang,Jinyue Guo,Yizhuo Lu,Ruining Zhou,Hua Han
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的数据增强框架(NeuroDiff),用于电子显微镜神经元分割,通过分辨率感知的条件扩散模型和生物学引导的掩码重塑模块,生成结构多样且真实的图像-标签对,在低标注场景下显著提升分割性能。
Details
Motivation: 现有深度学习方法依赖大量人工标注数据,而传统数据增强方法生成样本结构多样性不足,难以提升神经元分割性能。 Method: 提出基于扩散模型的数据增强框架:1)分辨率感知的条件扩散模型,结合多尺度条件与EM分辨率先验,实现从3D掩码到体素级图像的合成;2)生物学引导的掩码重塑模块,增强掩码结构真实性。 Result: 在AC3和AC4数据集低标注设置下,结合两种后处理方法,ARAND指标分别提升32.1%和30.7%。 Conclusion: 该扩散增强框架能有效缓解标注稀缺问题,提升神经元分割精度,具备生物学合理性与结构生成能力。 Abstract: Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at https://github.com/HeadLiuYun/NeuroDiff.[77] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Pascal Benschop,Justin Dauwels,Jan van Gemert
Main category: cs.CV
TL;DR: 本文提出了一种合成视频基准,用于评估视觉语言模型(VLMs)在情境感知(如识别暴力/非暴力)和空间感知(如角色绑定、轨迹对齐)方面的脆弱性;实验表明现有VLMs表现仅略高于随机水平,简单颜色线索仅部分缓解问题,作者开源数据与代码以推动轻量级空间先验研究。
Details
Motivation: 视觉语言模型在依赖细微时间或几何线索的场景下空间推理能力仍较脆弱,亟需可复现、细粒度的诊断基准来揭示其局限性。 Method: 构建一个基于最小视频对的合成基准,涵盖三类挑战:暴力与良性活动区分、跨视角施害者角色绑定、细粒度运动轨迹对齐;在零样本设定下评估多种VLM,并引入稳定颜色线索作为辅助分析手段。 Result: 当前主流VLM在各项任务上性能仅略高于随机水平;稳定颜色线索可部分缓解施害者角色混淆,但无法根本解决空间推理缺陷。 Conclusion: 现有VLM的空间与情境联合推理能力严重不足,需探索轻量级空间先验机制以弥补大规模预训练的短板;本工作提供开源基准,支持可复现诊断与后续方法改进。 Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.[78] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks
Mustafa Yurdakul,Enes Ayan,Fahrettin Horasan,Sakir Tasdemir
Main category: cs.CV
TL;DR: 本文提出了一种基于CNN的移动应用,用于非专业人士快速识别花卉种类,通过比较MobileNet、DenseNet121和Xception三种模型及七种优化算法,发现采用SGD优化的DenseNet-121在准确率(95.84%)、精确率、召回率和F1分数(均为96.00%)上表现最优。
Details
Motivation: 花卉识别通常需要专家知识,但专家资源难以随时获取;因此需开发一种便捷、高效的移动端自动识别工具。 Method: 构建基于CNN的移动应用,对比MobileNet、DenseNet121和Xception三种模型,并分别用七种优化算法训练评估其性能。 Result: DenseNet-121结合SGD优化算法达到最高性能:准确率95.84%,精确率、召回率和F1-score均为96.00%。 Conclusion: CNN模型(尤其是DenseNet-121)适用于移动端花卉分类任务,可有效支持非专业人士进行花卉识别。 Abstract: A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.[79] Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data
Clare Chemery,Hendrik Edelhoff,Ludwig Bothmann
Main category: cs.CV
TL;DR: 本文提出了一种轻量级机器学习实验流程,使生态学家无需深厚ML背景即可独立构建任务定制化图像分类模型;在红鹿年龄与性别分类任务中,仅用数千张图像即达到90.77%和96.15%的准确率,验证了小数据下解决特定生态问题的可行性。
Details
Motivation: 降低生态学家应用机器学习进行图像分类的门槛,使其摆脱对现成模型的依赖,能针对本地数据和具体研究问题(如野生动物种群 demographics)构建定制化模型。 Method: 设计集命令行(预处理、训练、评估)与图形界面(标注、错误分析、模型比较)于一体的轻量级ML实验流程;在红鹿相机陷阱图像数据集上系统测试多种骨干网络、超参与数据增强策略。 Result: 在3392张原始图像、4352张专家标注的裁剪图像上,最佳模型实现年龄分类准确率90.77%,性别分类96.15%;证明小规模高质量数据足以支撑可靠的人口统计学分类。 Conclusion: 该框架为生态学家提供了低门槛、高适配性的ML建模工具,推动机器学习在野生动物监测与种群分析中的实际落地与广泛采用。 Abstract: We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.[80] Towards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion
Yonghao Xu,Pedram Ghamisi,Qihao Weng
Main category: cs.CV
TL;DR: 本文首次将数据集蒸馏引入遥感图像解译领域,利用文本到图像扩散模型压缩大规模遥感数据集,并通过分类器引导和潜在空间聚类提升合成样本判别性与多样性。
Details
Motivation: 解决遥感图像深度学习依赖大规模标注数据带来的高存储计算成本与敏感数据泄露风险。 Method: 提出基于文本到图像扩散模型的数据集蒸馏方法;引入预训练分类器的一致性损失进行分类器驱动引导;在潜在空间聚类选取代表性原型作为视觉风格引导,并用视觉语言模型生成聚合文本描述。 Result: 在三个高分辨率遥感场景分类基准上验证了蒸馏样本的真实性与多样性,显著提升下游模型训练效果。 Conclusion: 数据集蒸馏可有效缓解遥感图像分析中对大规模数据的依赖,在保证性能的同时降低存储、计算开销与隐私风险。 Abstract: Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (https://github.com/YonghaoXu/DPD).[81] An IoT-Based Smart Plant Monitoring and Irrigation System with Real-Time Environmental Sensing, Automated Alerts, and Cloud Analytics
Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
Main category: cs.CV
TL;DR: 本文提出了一种基于物联网的智能植物监测系统,利用ESP32和多种传感器实现环境参数实时采集、自动灌溉与云端分析,显著提升水分利用效率(节水40%)和土壤湿度控制精度(92%),成本仅45.20美元,适用于小规模园艺与商业农业。
Details
Motivation: 全球对可持续农业的需求日益增长,传统耕作方式依赖人工观察和周期性浇水,易造成水资源浪费、植物生长不均及对环境变化响应滞后。 Method: 采用ESP32微控制器集成DHT22(温湿度)、HC-SR04(水位)和土壤湿度传感器,通过OLED显示与蜂鸣器报警提供本地反馈,并将数据无线上传至ThingSpeak云平台进行远程监控、历史分析与自动告警;配套开发了可视化Web仪表盘。 Result: 系统在维持最优土壤湿度方面达到92%准确率,实现实时环境监测,节水约40%,总成本为45.20美元。 Conclusion: 该系统是一种低成本、可扩展的精准农业解决方案,兼具实用性与推广价值,适用于从小型园艺到商业农业的多场景应用。 Abstract: The increasing global demand for sustainable agriculture necessitates intelligent monitoring systems that optimize resource utilization and plant health management. Traditional farming methods rely on manual observation and periodic watering, often leading to water wastage, inconsistent plant growth, and delayed response to environmental changes. This paper presents a comprehensive IoT-based smart plant monitoring system that integrates multiple environmental sensors with automated irrigation and cloud analytics. The proposed system utilizes an ESP32 microcontroller to collect real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with visual feedback through an OLED display and auditory alerts via a buzzer. All sensor data is wirelessly transmitted to the ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alert generation. Experimental results demonstrate the system's effectiveness in maintaining optimal soil moisture levels (with 92\% accuracy), providing real-time environmental monitoring, and reducing water consumption by approximately 40\% compared to conventional irrigation methods. The integrated web dashboard offers comprehensive visualization of plant health parameters, making it suitable for both small-scale gardening and commercial agriculture applications. With a total implementation cost of \$45.20, this system provides an affordable, scalable solution for precision agriculture and smart farming.[82] TinySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing
Toan Gian,Dung T. Tran,Viet Quoc Pham,Francesco Restuccia,Van-Dinh Nguyen
Main category: cs.CV
TL;DR: TinySense is a novel compression framework for Wi-Fi-based human pose estimation that uses a VQGAN-learned codebook, K-means-based bitrate adaptation, and a Transformer to enhance robustness, achieving higher accuracy, lower latency, and reduced network overhead.
Details
Motivation: To address the high networking resource consumption caused by processing large amounts of raw CSI data in Wi-Fi sensing for human pose estimation, especially under device-free and privacy-preserving requirements. Method: TinySense employs a vector quantization-based generative adversarial network (VQGAN) to compress CSI data; uses K-means to dynamically cluster and adapt the pre-trained codebook for flexible bitrate control; and integrates a Transformer model to compensate for bitrate-induced information loss and improve robustness in unreliable networks. Result: TinySense achieves up to 1.5× higher HPE accuracy (PCK20), up to 5× lower latency, and up to 2.5× reduction in networking overhead compared to state-of-the-art compression schemes, validated on a Jetson Nano and Raspberry Pi testbed. Conclusion: TinySense enables scalable, efficient, and robust Wi-Fi sensing for human pose estimation by jointly optimizing compression, reconstruction fidelity, and network adaptability. Abstract: With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.[83] A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies
Jingsong Xia,Siqi Wang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、受脑启发的深度学习框架,用于冠状动脉造影(CAG)图像二分类,在复杂病变、标签不确定和类别不平衡等现实挑战下,兼顾高精度与高计算效率。
Details
Motivation: 现实临床中冠状动脉造影图像存在复杂病变形态、严重类别不平衡、标签不确定性及计算资源受限等问题,导致传统深度学习方法鲁棒性与泛化性不足。 Method: 基于预训练CNN构建轻量混合神经表征;引入选择性神经可塑性训练策略实现高效参数自适应;设计融合Focal Loss与标签平滑的脑启发注意力调制损失函数;结合类别不平衡感知采样与带热重启的余弦退火优化策略。 Result: 在二分类任务中取得具有竞争力的准确率、召回率、F1分数和AUC,同时保持高计算效率。 Conclusion: 验证了脑启发学习机制在轻量级医学图像分析中的有效性,为资源受限场景下的智能临床决策支持提供了生物可解释且可部署的解决方案。 Abstract: Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.[84] Out-of-Distribution Detection Based on Total Variation Estimation
Dabiao Ma,Zhiba Su,Jian Yang,Haojun Fei
Main category: cs.CV
TL;DR: 本文提出了一种名为TV-OOD的新型分布外检测方法,利用总变差网络估计器计算输入对总变差的贡献,从而区分分布内与分布外数据,在图像分类任务中表现优异。
Details
Motivation: 现有分布外检测方法虽有效,但在实际应用中面对分布偏移时仍有提升空间,需更鲁棒、可解释的检测机制。 Method: 提出TV-OOD方法,基于总变差网络估计器(Total Variation Network Estimator)为每个输入计算总变差得分,以此作为判别分布内/外数据的依据。 Result: 在多种模型和数据集上的实验表明,TV-OOD在各项评估指标上结果均与或优于当前前沿OOD检测方法。 Conclusion: TV-OOD是一种有效、通用且性能优越的分布外检测方法,适用于保障机器学习模型在分布偏移场景下的部署安全性。 Abstract: This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input's contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method's efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.[85] PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis
Yifan Chen,Fei Yin,Hao Chen,Jia Wu,Chao Li
Main category: cs.CV
TL;DR: 本文提出首个公开、完全配对、覆盖11个人体器官的泛癌种医学影像数据集(PMPBench),支持MR动态增强(DCE)和CT平扫/增强(CT/CTC)的多阶段配对,用于AI驱动的无造影剂图像合成,并构建了基准测试平台。
Details
Motivation: 现有AI合成对比增强图像的研究受限于数据稀缺:公共数据集多限于脑部MR配对数据,其他数据存在配对不全、时序/空间错位、模态标签缺失及大量私有资源未开放等问题。 Method: 构建首个全配对、跨器官、多模态(MR DCE1-DCE3、CT/CTC)、解剖对齐的泛癌种公开数据集;基于该数据集建立涵盖1-to-1、N-to-1、N-to-N翻译任务的综合基准,评估主流图像到图像翻译模型性能。 Result: 发布了PMPBench数据集与基准,包含MR动态增强序列和CT平扫/增强配对数据,支持多器官、多阶段造影合成任务;提供了代表性模型的基准结果;代码与数据集全部开源。 Conclusion: PMPBench填补了泛癌种、全配对医学影像数据的空白,为安全、高效的AI辅助造影合成研究提供了关键基础设施,推动多器官肿瘤影像工作流的临床转化。 Abstract: Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.[86] Understanding the Transfer Limits of Vision Foundation Models
Shiqi Huang,Yipei Wang,Natasha Thorley,Alexander Ng,Shaheer Saeed,Mark Emberton,Shonit Punwani,Veeru Kasivisvanathan,Dean Barratt,Daniel Alexander,Yipeng Hu
Main category: cs.CV
TL;DR: 本文探讨了视觉基础模型(VFMs)在下游任务中表现不均衡的问题,提出预训练目标与下游任务需求之间的不匹配是主要原因,并通过前列腺多参数MRI任务验证了预训练与下游任务对齐的重要性。
Details
Motivation: 视觉基础模型(VFMs)虽经大量计算预训练,但在下游任务中表现不均,作者认为这是由于预训练目标(如掩码图像重建或对比学习)与下游任务(如分割、分类、图像合成)的具体需求不匹配所致。 Method: 作者选取两种VFMs——基于MAE的重建型模型ProFound和基于对比学习的ProViCNet,在五个前列腺多参数MRI任务上进行评估;使用最大均值差异(MMD)等简单发散度量来量化预训练与下游任务间的对齐程度,并分析其与迁移性能(微调效果与收敛速度)的关系。 Result: 实验表明,预训练与下游任务之间更高的对齐度(以MMD衡量)显著提升微调性能并加快收敛;即任务对齐性可作为预测和指导VFM迁移效果的有效指标。 Conclusion: 设计预训练目标时应更注重其与下游任务的适配性;任务对齐不仅是影响迁移性能的关键因素,还可作为模型选择与预训练策略优化的重要依据。 Abstract: Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.[87] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Anas Anwarul Haq Khan,Mariam Husain,Kshitij Jadhav
Main category: cs.CV
TL;DR: RadJEPA是一种无需语言监督的自监督医学视觉表征学习框架,基于联合嵌入预测架构,在无标注胸部X光图像上预训练,通过预测掩码图像区域的潜在表示来学习,显著提升了疾病分类、语义分割和报告生成等下游任务性能。
Details
Motivation: 现有医学视觉语言模型依赖配对的图像-文本数据,但此类数据稀缺;本文探索能否在不依赖语言监督的情况下学习鲁棒的放射学图像编码器。 Method: 提出RadJEPA框架,基于联合嵌入预测架构(JEPA),仅使用无标注胸部X光图像进行自监督预训练,目标是预测被掩码图像区域的潜在空间表示,不同于图像-文本对齐或DINO式自蒸馏。 Result: 在疾病分类、语义分割和报告生成任务中,RadJEPA性能超越包括Rad-DINO在内的当前最优方法。 Conclusion: 无需语言监督的潜在空间预测式自监督学习可有效提升放射学图像编码器性能,为医学视觉表征学习提供新范式。 Abstract: Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.[88] ThermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling
Zhaoqi Su,Shihai Chen,Xinyan Lin,Liqin Huang,Zhipeng Su,Xiaoqiang Lu
Main category: cs.CV
TL;DR: 本文提出ThermoSplat框架,通过跨模态FiLM调制与模态自适应几何解耦,实现RGB与热红外数据的深度光谱感知三维高斯溅射重建,在RGBT-Scenes数据集上达到可见光与热红外双谱段渲染性能最优。
Details
Motivation: 现有3D高斯溅射方法难以有效融合RGB与热红外多模态数据,常忽视跨模态相关性或无法自适应处理不同光谱间的结构关联与物理差异。 Method: 提出ThermoSplat框架:1)跨模态FiLM调制机制,利用热成像结构先验动态调控共享隐特征以指导可见光纹理合成;2)模态自适应几何解耦方案,为热分支学习独立不透明度偏移并执行独立光栅化;3)融合球谐显式表征与隐式神经解码的混合渲染管线。 Result: 在RGBT-Scenes数据集上,ThermoSplat在可见光与热红外两个谱段均取得当前最优渲染质量。 Conclusion: ThermoSplat通过光谱感知的特征调制与几何解耦策略,显著提升了多光谱场景重建的鲁棒性与保真度,为复杂环境下的多模态三维感知提供了新范式。 Abstract: Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.[89] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models
Zhen Zhang,Runhao Zeng,Sicheng Zhao,Xiping Hu
Main category: cs.CV
TL;DR: 本文通过系统性机制研究发现,多模态基础模型中情感建模的核心结构位于前馈网络的门控投影层(gate_proj),而非注意力模块;仅微调该模块即可实现接近全参数微调的情感任务性能,显著提升参数效率。
Details
Motivation: 尽管当前情感模型表现出色,但其内部支持情感理解与生成的架构机制仍不清楚,尤其在多模态情感建模中情感如何在模型中表征尚属开放问题。 Method: 在多种架构、训练策略和情感任务上,分析情绪导向监督如何重塑模型内部参数;采用受控模块迁移、单模块针对性适配和破坏性消融等方法验证gate_proj的作用。 Result: 情感适配主要定位在feed-forward gating projection(gate_proj)而非注意力模块;仅调优约24.5%的参数(相比AffectGPT),即可达到其96.6%的平均任务性能;gate_proj被证实是情感理解与生成的充分、高效且必要组件。 Conclusion: 情感能力在基础模型中由前馈门控机制结构性介导,gate_proj是情感建模的关键架构位点。 Abstract: Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate\_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate\_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5\% of the parameters tuned by AffectGPT, our approach achieves 96.6\% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate\_proj} as a central architectural locus of affective modeling.[90] The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars
Yarin Benyamin
Main category: cs.CV
TL;DR: 本文研究了在VR环境中为自闭症谱系障碍(ASD)患者提供实时情绪识别支持的可行性,发现现有通用深度学习模型难以满足低延迟(<140ms)与高精度的双重需求,尤其在分类阶段存在‘延迟墙’;YOLOv11n在检测阶段表现最优,而CLIP等通用视觉Transformer在准确率和速度上均不达标,因此亟需轻量级、领域专用架构。
Details
Motivation: 为ASD患者开发可及、实时的VR辅助治疗系统,需满足严格的运动到光子(MTP)延迟约束(<140 ms),但现有主流深度学习模型偏向精度而忽视实时性。 Method: 在UIBVFED数据集上对零样本虚拟角色面部表情识别(FER)任务进行基准测试,评估YOLO系列(v8/v11/v12的Medium/Nano变体)用于人脸检测,以及CLIP、SigLIP、ViT-FER等通用视觉Transformer用于表情分类;全部实验基于纯CPU推理。 Result: 人脸检测在风格化虚拟头像上达到100%准确率,YOLOv11n检测耗时约54 ms;但分类阶段出现‘延迟墙’,CLIP和SigLIP准确率低于23%且推理时间超150 ms,无法满足实时闭环要求。 Conclusion: 通用大模型难以兼顾VR治疗场景下的低延迟与可用精度,必须设计轻量级、面向虚拟角色表情识别的专用轻量架构,以推动可及、实时AI在临床康复中的落地。 Abstract: In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.[91] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Valery Fischer,Alan Magdaleno,Anna-Katharina Calek,Nicola Cavalcanti,Nathan Hoffman,Christoph Germann,Joschua Wüthrich,Max Krähenmann,Mazda Farshad,Philipp Fürnstahl,Lilian Calvet
Main category: cs.CV
TL;DR: 本文提出了一种无需领域微调、仅依赖现成预训练模型的多视角3D手部姿态估计方法,并构建了首个大规模手术场景手部标注数据集,显著提升了2D和3D姿态估计精度。
Details
Motivation: 手术环境光照强烈且不均、手部常被遮挡、戴手套导致外观单一,加之缺乏高质量标注数据,使得现有3D手部姿态估计算法难以适用。 Method: 设计了一个端到端多视角流水线:结合人体检测、全身姿态估计、手部区域裁剪后的2D关键点预测,再通过约束性3D优化得到最终结果;同时构建了含68,000帧、3,000组人工标注2D关键点及三角化3D真值的手术专用基准数据集。 Result: 相比基线方法,2D平均关节点误差降低31%,3D平均每关节位置误差降低76%。 Conclusion: 本工作为手术场景下的3D手部姿态估计提供了无需训练的实用方案和首个大规模高质量标注数据集,奠定了该方向研究的新基线。 Abstract: Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.[92] Class Confidence Aware Reweighting for Long Tailed Learning
Brainard Philemon Jagati,Jitendra Tembhurne,Harsh Goud,Rudra Pratap Singh,Chandrashekhar Meshram
Main category: cs.CV
TL;DR: 本文提出了一种基于损失水平的类别与置信度感知重加权方案,用于解决长尾数据分布下的深度神经网络性能退化问题,该方案与现有logit调整方法互补,并在多个长尾数据集上验证了其有效性。
Details
Motivation: 深度神经网络在长尾数据分布下性能显著下降,现有方法主要关注决策空间(如logit层)的调整以补偿类别先验偏差,而忽视了优化过程中因样本置信度差异带来的影响。 Method: 设计了一种类别与置信度感知的重加权方案,完全基于损失值,通过Ω(p_t, f_c)函数根据预测置信度和类别相对频率动态调节各样本对训练的贡献。 Result: 在CIFAR-100-LT、ImageNet-LT和iNaturalist2018等多个长尾数据集上,不同不平衡因子下均取得显著提升,实验结果验证了方法的有效性与理论分析的一致性。 Conclusion: 所提出的重加权方案是一种有效且可与logit调整类方法互补的长尾学习策略,强调了在优化过程中同时建模类别频率与样本置信度的重要性。 Abstract: Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an Ω(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.[93] NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation
Liuyun Jiang,Yizhuo Lu,Yanchao Zhang,Jiazheng Liu,Hua Han
Main category: cs.CV
TL;DR: 本文提出NeuroMamba,一种基于Mamba架构的多视角神经元分割框架,结合全局建模与局部细节保持,在多个电镜数据集上达到SOTA性能。
Details
Motivation: 现有CNN方法缺乏长程上下文导致边界模糊,Transformer方法因分块损失体素级细节而边界不精确。 Method: 提出NeuroMamba框架:1)通道门控的边界判别特征提取器(BDFE)增强局部形态线索;2)融合分辨率感知扫描机制的空域连续特征提取器(SCFE)适配不同分辨率下的全局建模;3)跨调制机制融合多视角特征。 Result: 在四个公开EM数据集上达到SOTA性能,验证了对各向异性和各向同性分辨率的强适应性。 Conclusion: NeuroMamba通过Mamba实现无需分块的全局建模,并协同局部精细建模,有效平衡长程依赖捕获与体素级细节保留,显著提升神经元分割精度。 Abstract: Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.[94] EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis
Sheng Miao,Sijin Li,Pan Wang,Dongfeng Bai,Bingbing Liu,Yue Wang,Andreas Geiger,Yiyi Liao
Main category: cs.CV
TL;DR: EvolSplat4D是一种面向城市动态与静态场景的前馈式新型视图合成方法,通过三分支结构统一处理近场静态、动态目标和远场场景,兼顾效率与质量,在多个数据集上优于现有优化型和前馈型方法。
Details
Motivation: 现有神经辐射场和3D高斯泼溅方法需耗时的逐场景优化;而新兴前馈方法多采用逐像素高斯表示,导致复杂动态环境中的3D不一致性。 Method: 提出EvolSplat4D前馈框架,包含三个专用分支:1)基于3D特征体预测近距静态区域多帧一致的3D高斯几何,并结合语义增强图像渲染模块预测外观;2)在物体中心规范空间中利用运动校正渲染模块聚合时序特征,实现鲁棒4D动态重建;3)用高效逐像素高斯分支处理远场场景以保障全场景覆盖。 Result: 在KITTI-360、KITTI、Waymo和PandaSet数据集上,EvolSplat4D在静态与动态环境重建的精度和一致性上均优于逐场景优化方法及前沿前馈基线。 Conclusion: EvolSplat4D成功突破了前馈式新型视图合成在效率与质量之间的权衡瓶颈,为自动驾驶仿真提供了更高效、更可靠的动态城市场景建模方案。 Abstract: Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.[95] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models
Xin Xie,Jiaxian Guo,Dong Gong
Main category: cs.CV
TL;DR: 本文提出HyperAlign框架,通过训练超网络在测试时动态生成低秩适配权重来调节扩散模型的生成过程,从而在不牺牲多样性或引入过高计算开销的前提下,提升生成图像的语义一致性和视觉美感。
Details
Motivation: 扩散模型虽性能优越,但常生成不符合人类偏好和意图的图像,存在美学质量差和语义不一致问题;现有对齐方法在多样性损失与计算开销之间难以兼顾。 Method: 提出HyperAlign框架,利用超网络在测试时动态生成低秩适配权重,调制扩散模型的去噪操作;适配权重依赖输入潜变量、时间步和提示;设计多种应用频率变体,并以带偏好数据正则化的奖励分数为目标优化超网络。 Result: 在Stable Diffusion和FLUX等模型上验证,HyperAlign显著优于现有微调和测试时缩放基线,在语义一致性与视觉吸引力方面均有提升。 Conclusion: HyperAlign实现了高效、灵活且鲁棒的测试时对齐,有效缓解了奖励过优化与计算代价之间的权衡问题,为扩散模型的人类意图对齐提供了新范式。 Abstract: Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.[96] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
Tingyu Song,Yanzhao Zhang,Mingxin Li,Zhuoning Guo,Dingkun Long,Pengjun Xie,Siyue Zhang,Yilun Zhao,Shu Wu
Main category: cs.CV
TL;DR: 本文提出EDIR,一个基于图像编辑生成的细粒度合成图像检索基准,涵盖5000个高质量查询,揭示现有模型在多样化类别上的性能差距,并分析了当前基准的局限性。
Details
Motivation: 现有CIR基准查询类别有限,无法反映真实场景的多样性需求,亟需更全面、可控、细粒度的评估基准。 Method: 利用图像编辑技术精确控制修改类型与内容,构建覆盖广泛类别的合成查询生成流程,并据此创建EDIR基准;对13种多模态嵌入模型进行系统评测,并开展领域内训练实验以分析任务难点。 Result: EDIR包含5000个查询,分5大类15子类;SOTA模型(如RzenEmbed、GME)在各子类上表现不一致;发现现有基准存在模态偏差和类别覆盖不足等问题;领域内训练表明部分子类可提升,而另一些暴露模型架构固有局限。 Conclusion: EDIR是一个更具挑战性和代表性的CIR基准,能有效揭示模型能力边界,推动更鲁棒、均衡的多模态理解模型发展。 Abstract: Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.[97] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
Chak-Wing Mak,Guanyu Zhu,Boyi Zhang,Hongji Li,Xiaowei Chi,Kevin Zhang,Yichen Wu,Yangfan He,Chun-Kai Fan,Wentao Lu,Kuangzhi Ge,Xinyu Fang,Hongyang He,Kuan Lu,Tianxiang Xu,Li Zhang,Yongxin Ni,Youhua Li,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出了PhysicsMind基准,用于评估多模态大模型和视频世界模型对物理定律(质心、杠杆平衡、牛顿第一定律)的理解能力,涵盖视觉问答与视频生成两类任务,并揭示当前模型仍依赖表观启发式、违背基本力学原理。
Details
Motivation: 现有基准无法有效衡量模型对物理规律的理解,多依赖合成数据或感知质量,缺乏对物理一致性推理与生成的统一评估。 Method: 构建了包含真实场景与仿真环境的PhysicsMind基准,设计VQA(物理量推理)和VG(物理约束视频生成)两类任务,覆盖质心、杠杆平衡和牛顿第一定律三大经典物理原理。 Result: 在PhysicsMind上评测多种前沿MLLMs和视频生成模型,发现其普遍依赖外观启发式,频繁违反基本力学约束,表明当前缩放与训练策略尚不足以实现鲁棒的物理理解。 Conclusion: PhysicsMind为物理感知多模态模型提供了聚焦、可扩展的测试平台,凸显提升模型内在物理推理能力的必要性与紧迫性。 Abstract: Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.[98] Keyframe-Based Feed-Forward Visual Odometry
Weichen Dai,Wenhan Su,Da Kong,Yuhang Ming,Wanzeng Kong
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的自适应关键帧选择策略,用于改进视觉基础模型驱动的前馈式视觉里程计(VO),在保持端到端特性的同时提升效率与精度。
Details
Motivation: 现有基于视觉基础模型的VO/SLAM方法(如VGGT-Long)直接处理原始图像序列,缺乏关键帧机制,导致计算冗余和因帧间视差小而引起的性能下降;且难以将传统几何启发式方法融入依赖高维隐表示的基础模型中。 Method: 提出一种新型的关键帧驱动前馈VO框架,利用强化学习在数据驱动下学习自适应关键帧选择策略,使其适配基础模型的内在表征特性,而非依赖手工设计规则;在TartanAir数据集上训练,并在多个真实世界数据集上评估。 Result: 实验表明,该方法在多个真实数据集上持续显著优于当前最先进的前馈式VO方法。 Conclusion: 将强化学习引入关键帧选择可有效桥接几何先验与视觉基础模型,实现高效、高精度的前馈VO,为下一代端到端视觉导航系统提供了新范式。 Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.[99] PAINT: Pathology-Aware Integrated Next-Scale Transformation for Virtual Immunohistochemistry
Rongze Ma,Mengkang Lu,Zhenyu Xiang,Yongsheng Pan,Yicheng Wu,Qingjie Zeng,Yong Xia
Main category: cs.CV
TL;DR: 本文提出PAINT框架,通过结构优先的条件生成方法,利用空间结构起始图(3S-Map)引导视觉自回归合成虚拟免疫组化图像,显著提升结构保真度和临床任务性能。
Details
Motivation: 传统虚拟IHC方法因缺乏足够结构先验,导致语义不一致;H&E图像形态线索对蛋白表达指示模糊,相似结构可能对应不同分子状态。 Method: 提出Pathology-Aware Integrated Next-Scale Transformation(PAINT),一种视觉自回归框架,将合成建模为结构优先的条件生成任务,并引入Spatial Structural Start Map(3S-Map)作为自回归初始化基础,确保空间对齐与确定性合成。 Result: 在IHC4BC和MIST数据集上,PAINT在结构保真度和临床下游任务(如分子亚型分类、预后预测)中均优于现有最先进方法。 Conclusion: 结构引导的自回归建模是提升虚拟IHC合成质量与临床适用性的有效范式,3S-Map为形态到分子映射提供了可靠结构锚点。 Abstract: Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\&E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H\&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.[100] ProGiDiff: Prompt-Guided Diffusion-Based Medical Image Segmentation
Yuan Lin,Murong Xu,Marc Hölle,Chinmay Prabhakar,Andreas Maier,Vasileios Belagiannis,Bjoern Menze,Suprosanna Shit
Main category: cs.CV
TL;DR: 本文提出ProGiDiff框架,利用预训练扩散模型结合ControlNet式条件机制实现医学图像多类分割,并支持自然语言提示和跨模态迁移。
Details
Motivation: 现有医学图像分割方法缺乏对自然语言提示的支持、多提案生成能力及跨模态适应性;而文本到图像扩散模型虽有潜力,但需大量数据且难以用于多类分割和语言引导。 Method: 提出ProGiDiff框架,设计ControlNet风格的图像条件机制与定制编码器,将预训练扩散模型引导生成分割掩码;支持通过自然语言提示指定目标器官,实现多类分割;并采用低秩少样本适配实现跨模态(CT→MR)迁移。 Result: 在CT器官分割任务上性能优于先前方法;支持专家参与下的多提案选择;经少量MR数据微调即可有效迁移至MR图像分割。 Conclusion: ProGiDiff为医学图像分割提供了可提示、多类、跨模态且可交互的新范式,显著提升临床实用性与泛化能力。 Abstract: Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.[101] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Chenyang Li,Jieyuan Liu,Bin Li,Bo Gao,Yilin Yuan,Yangfan He,Yuchen Li,Jingqun Tang
Main category: cs.CV
TL;DR: 本文提出了一种即插即用的‘干扰令牌剪枝’(DTP)框架,用于动态检测并剪除视觉-语言动作(VLA)模型中任务无关区域的干扰图像令牌,从而提升任务成功率,且不改变模型架构或增加额外输入。
Details
Motivation: VLA模型默认可能过度关注任务无关区域的图像令牌(即‘干扰令牌’),影响动作生成和任务成功率。 Method: 提出Distracting Token Pruning(DTP)框架,动态检测并剪枝干扰图像令牌,校正视觉注意力模式。 Result: 在SIMPLER基准上,DTP对多种新型VLA模型均带来一致的相对成功率提升;分析发现任务成功率与任务无关区域注意力强度呈负相关。 Conclusion: DTP是一种简单有效、通用性强的即插即用方法,揭示了VLA模型中普遍存在的注意力偏差问题,为未来研究提供新方向。 Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.[102] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models
Hanwen Zhang,Qiaojin Shen,Yuxi Liu,Yuesheng Zhu,Guibo Luo
Main category: cs.CV
TL;DR: DSFedMed is a dual-scale federated framework for medical image segmentation that enables mutual knowledge distillation between a centralized foundation model and lightweight client models, improving performance while drastically reducing communication and inference costs.
Details
Motivation: Foundation Models (FMs) face challenges in federated settings due to high computation, communication overhead, and inference cost—especially critical in resource-limited medical applications. Method: DSFedMed introduces mutual knowledge distillation between a central foundation model and lightweight client models; uses synthetically generated high-quality medical images and a learnability-guided sample selection strategy to replace real public data and improve distillation efficiency. Result: On five medical segmentation datasets, DSFedMed achieves ~2% Dice score improvement and reduces communication costs and inference time by ~90% compared to existing federated FM baselines. Conclusion: DSFedMed significantly enhances efficiency and scalability of foundation models in federated medical imaging, enabling practical deployment under resource constraints. Abstract: Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.[103] Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian,Siwei Zhang,Bharat Lal Bhatnagar,Federica Bogo,Siyu Tang
Main category: cs.CV
TL;DR: MoRo提出了一种基于掩码建模的端到端生成式框架,用于从单目视频鲁棒地重建人体运动,尤其在频繁遮挡场景下表现优异,兼具高精度、高真实感与实时推理能力(70 FPS)
Details
Motivation: 现有方法在遮挡场景下存在脆弱性(回归法)或效率低(优化/扩散法),且缺乏足够配对的视频-动作数据 Method: 提出MoRo框架:1)以视频为条件的掩码建模任务;2)跨模态学习方案,融合三类先验——轨迹感知的动作先验(MoCap)、图像条件的姿态先验(图像-姿态数据集)、视频条件的掩码Transformer(融合并微调) Result: 在EgoBody和RICH数据集上显著超越SOTA方法,遮挡下精度与运动真实性更优,非遮挡下性能相当;单H200 GPU达70 FPS实时推理 Conclusion: MoRo通过掩码建模与跨模态先验融合,实现了遮挡鲁棒、高效、端到端的人体运动重建,为真实场景应用提供了新范式 Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.[104] SAMTok: Representing Any Mask with Two Words
Yikang Zhou,Tao Zhang,Dengxian Gong,Yuanzheng Wu,Ye Tian,Haochen Wang,Haobo Yuan,Jiacong Wang,Lu Qi,Hao Fei,Anran Wang,Zhuochen Wang,Yujing Wang,Cheng Chen,Shunping Ji,Xiangtai Li
Main category: cs.CV
TL;DR: 本文提出SAMTok,一种离散掩码分词器,将区域掩码转换为两个特殊标记,使基础多模态大语言模型(如QwenVL)无需架构修改即可通过标准语言建模和简单强化学习获得像素级理解能力。
Details
Motivation: 现有像素级多模态大语言模型(MLLMs)因依赖复杂区域编码器、专用分割解码器及不兼容训练目标而难以扩展。 Method: 提出SAMTok离散掩码分词器,基于SAM2,使用掩码编码器与残差向量量化器将任意区域掩码压缩为两个高保真、信息丰富的离散token;将掩码视为新语言token,接入基础MLLM(如QwenVL系列),通过标准next-token预测与基于文本答案匹配奖励的强化学习进行训练。 Result: 在区域描述、区域视觉问答、接地对话、指代表达分割、场景图解析及多轮交互式分割等任务上达到SOTA或相当水平;在GRES和GCG基准上显著提升;仅需500万样本即实现高效训练。 Conclusion: SAMTok提供了一种可扩展、简洁且通用的范式,使MLLM无需修改架构即可获得强像素级能力,推动交互式智能系统发展。 Abstract: Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.[105] Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification
Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Quinn Ledingham,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 本文提出CSSMamba框架,通过聚类引导的空间-光谱Mamba结构和注意力驱动的令牌选择机制,提升高光谱图像分类性能。
Details
Motivation: Mamba模型在高光谱图像分类中存在难以定义高效自适应令牌序列的问题。 Method: 提出CSSMamba框架,包括聚类引导的空间Mamba模块(CSpaMamba)、光谱Mamba模块(SpeMamba)、注意力驱动令牌选择机制及可学习聚类模块。 Result: 在Pavia University、Indian Pines和Liao-Ning 01数据集上,CSSMamba在分类精度和边界保持方面优于CNN、Transformer及现有Mamba方法。 Conclusion: CSSMamba通过融合聚类与Mamba架构,有效提升了高光谱图像分类性能与特征表达能力。 Abstract: Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.[106] Learning to Watermark in the Latent Space of Generative Models
Sylvestre-Alvise Rebuffi,Tuan Tran,Valeriu Lacatusu,Pierre Fernandez,Tomáš Souček,Nikola Jovanović,Tom Sander,Hady Elsahar,Alexandre Mourachko
Main category: cs.CV
TL;DR: 本文提出DistSeal,一种在生成模型潜在空间中进行水印嵌入的统一方法,通过蒸馏将水印模型集成到生成模型或其潜在解码器中,显著提升效率(最高20倍加速)与鲁棒性,同时保持图像不可感知性。
Details
Motivation: 现有AI图像水印方法多为后处理式、基于像素空间,存在计算开销大和易引入视觉伪影的问题。 Method: 提出潜在空间水印方法DistSeal:先在生成模型(扩散/自回归)的潜在空间中训练后处理水印模型,再将其蒸馏至生成模型本体或潜在解码器中,实现内建水印。 Result: 潜在水印在鲁棒性上媲美像素空间基线,不可感知性相当,并获得最高20倍的推理加速;蒸馏潜在水印模型的效果优于蒸馏像素空间模型。 Conclusion: 潜在空间水印是一种更高效、更鲁棒且实用的AI生成图像版权保护新范式,DistSeal为跨架构生成模型提供了统一可行的水印解决方案。 Abstract: Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.[107] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Remy Sabathier,David Novotny,Niloy J. Mitra,Tom Monnier
Main category: cs.CV
TL;DR: ActionMesh 是一种新型生成模型,通过引入时间轴改进3D扩散模型,实现快速、高质量、拓扑一致且无需绑定的动画3D网格生成,支持视频、文本或3D网格+文本等多种输入。
Details
Motivation: 现有生成动画3D对象的方法存在设置受限、运行时间长、质量低等问题,难以实际应用。 Method: 提出‘时序3D扩散’框架:1)修改3D扩散模型以生成时序同步的3D形状隐变量序列;2)设计时序3D自编码器,将独立形状序列映射为参考形状的形变序列,构建动画。 Result: 在Consistent4D和Objaverse等标准基准上达到几何精度和时序一致性SOTA;生成速度快、结果无绑定、拓扑一致,便于纹理映射与重定向。 Conclusion: ActionMesh实现了高质量、高效率、易集成的动画3D网格生成,显著推动了生成式4D内容创作的实用性发展。 Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.[108] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
Zequn Xie,Xin Liu,Boyun Zhang,Yuxiao Lin,Sihang Cai,Tao Jin
Main category: cs.CV
TL;DR: 本文提出了一种受人类视觉启发的文本-视频检索模型HVD,通过粗到细的对齐机制(包括关键帧选择和补丁特征压缩)来提升模型对关键视觉信息的关注能力,在多个基准上达到SOTA性能。
Details
Motivation: 现有文本-视频检索方法存在“盲”特征交互问题,难以从背景噪声中识别关键视觉信息,因文本查询稀疏导致对齐困难。 Method: 提出Human Vision-Driven (HVD) 模型,包含两个模块:Frame Features Selection Module (FFSM) 用于宏观感知、筛选关键帧;Patch Features Compression Module (PFCM) 用于微观感知、通过先进注意力机制聚合补丁特征为显著视觉实体。 Result: 在五个基准数据集上进行了大量实验,验证了HVD能有效模拟人类视觉聚焦,并取得当前最优性能(state-of-the-art)。 Conclusion: HVD通过模仿人类认知行为,显著提升了文本-视频检索中关键视觉信息的建模能力,为解决稀疏文本引导下的视觉冗余问题提供了新思路。 Abstract: The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.[109] 360Anything: Geometry-Free Lifting of Images and Videos to 360°
Ziyi Wu,Daniel Watson,Andrea Tagliasacchi,David J. Fleet,Marcus A. Brubaker,Saurabh Saxena
Main category: cs.CV
TL;DR: 本文提出360Anything,一种无需几何先验和相机参数的扩散Transformer框架,实现从单张图像/视频到360°全景图的端到端生成,并解决ERP边界接缝问题,同时具备零样本相机参数估计能力。
Details
Motivation: 现有方法依赖已知相机参数进行几何对齐,难以适用于野外无标定数据;需摆脱对显式几何建模和相机元数据的依赖。 Method: 基于预训练扩散Transformer,将透视图与全景图均视为token序列,纯数据驱动学习映射关系;引入Circular Latent Encoding解决VAE零填充导致的ERP边界 seam 问题。 Result: 在图像和视频的透视到360°生成任务上达到SOTA,超越使用真实相机参数的先前方法;在零样本FoV与朝向估计基准上表现具竞争力。 Conclusion: 证明了纯数据驱动、几何无关的扩散模型可有效建模视角变换,兼具高质量生成能力与隐式几何理解,拓展了其在计算机视觉中的应用边界。 Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.[110] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong,Boyang Zheng,Ziteng Wang,Bingda Tang,Nanye Ma,Ellis Brown,Jihan Yang,Rob Fergus,Yann LeCun,Saining Xie
Main category: cs.CV
TL;DR: 本文研究了Representation Autoencoders (RAEs)在大规模文本到图像(T2I)生成中的可扩展性,发现RAE在预训练和微调中均优于VAE(如FLUX),具有更强的稳定性、更快的收敛速度和更优的生成质量,并支持视觉理解与生成共享表征空间。
Details
Motivation: 探索RAE框架能否从ImageNet尺度扩展至大规模、自由形式的文本到图像生成任务,并验证其在更大规模、更复杂数据下的有效性与简化潜力。 Method: 1)扩展RAE解码器,在冻结SigLIP-2编码器基础上,使用网络数据、合成数据和文本渲染数据进行训练;2)系统评估RAE原始设计选择(如噪声调度、扩散头宽度、噪声增强解码)在大模型尺度下的必要性;3)在0.5B–9.8B参数范围内,与FLUX VAE进行控制变量对比实验,涵盖预训练与高质数据微调。 Result: RAE在所有模型尺度预训练中均优于VAE;微调中VAE在64轮后灾难性过拟合,而RAE稳定训练至256轮且性能更优;RAE收敛更快、生成质量更高;支持理解与生成共享表征空间,利于统一多模态建模。 Conclusion: RAE是比VAE更简单、更强大的大规模T2I生成基础架构,其优势源于表征空间的语义性与训练稳定性,为统一视觉理解与生成提供了新范式。 Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.[111] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Onkar Susladkar,Tushar Prakash,Adheesh Juvekar,Kiet A. Nguyen,Dong-Hwan Jang,Inderjit S Dhillon,Ismini Lourentzou
Main category: cs.CV
TL;DR: PyraTok是一种语言对齐的金字塔式视频分词器,通过多尺度离散化和共享大二进制码本提升跨模态对齐与零样本迁移能力,在多项视频任务上达到SOTA。
Details
Motivation: 现有离散视频VAE分词器通常单尺度、词汇量有限、语言监督浅,导致跨模态对齐差、零样本迁移能力弱。 Method: 基于预训练视频VAE,提出语言对齐金字塔量化(LaPQ)模块,在多个深度使用共享大二进制码本对编码器特征进行多尺度离散化,并联合优化多尺度文本引导量化与层级自回归目标。 Result: 在十个基准上实现SOTA视频重建;持续提升文本到视频生成质量;在视频分割、时序动作定位和视频理解等任务上取得新SOTA零样本性能,且可扩展至4K/8K分辨率。 Conclusion: PyraTok通过语义结构化、多尺度、语言对齐的离散表征,显著增强了视频生成与理解系统的跨模态建模与泛化能力。 Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.[112] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
Geo Ahn,Inwoong Lee,Taeoh Kim,Minho Shim,Dongyoon Wee,Jinwoo Choi
Main category: cs.CV
TL;DR: 本文研究了组合视频理解(CVU)任务,发现现有零样本组合动作识别(ZS-CAR)模型因‘物体驱动的动词捷径’而失效;作者提出RCORE框架,通过组合感知增强和时序顺序正则化损失来缓解该问题,并在多个基准上验证其有效性。
Details
Motivation: 现有ZS-CAR模型在未见动词-物体组合上泛化能力差,主因是模型依赖共现统计而非真正组合推理,即存在‘物体驱动的动词捷径’这一被忽视的失败模式。 Method: 提出RCORE框架:(i)组合感知的数据增强,多样化动词-物体组合同时保留运动线索;(ii)时序顺序正则化损失,显式建模时间结构以抑制捷径行为。 Result: RCORE在Sth-com和新构建的EK100-com两个基准上显著提升未见组合准确率,降低对共现偏差的依赖,并始终产生正向组合性差距。 Conclusion: 物体驱动的捷径是ZS-CAR的关键瓶颈,显式建模时序结构与增强组合多样性是实现鲁棒组合视频理解的必要途径。 Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.[113] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback
Wenhang Ge,Guibao Shen,Jiawei Feng,Luozhou Wang,Hao Lu,Xingye Tian,Xin Tao,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文提出CamPilot,通过引入相机感知的3D解码器将视频隐空间与相机姿态联合解码为3D高斯,利用几何失真引起的渲染模糊建模视频-相机对齐奖励,并结合可见性加权的一致性优化,显著提升扩散模型的相机可控性。