cs.CL [Back]

[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration

Longxuan Wei,Yubo Zhang,Zijiao Zhang,Zhihu Wang,Shiwan Zhao,Tianyu Huang,Huiting Zhao,Chenfei Liu,Shenao Zhang,Junchi Yan

Main category: cs.CL

TL;DR: 本文提出Entropy-Tree解码方法，利用模型预测熵指导树状搜索，在不确定性高处扩展，提升推理任务的准确性和校准性。

Details

Motivation: 现有解码策略（如随机采样或多链独立采样）存在盲目性或冗余性，缺乏对模型不确定性进行有效利用。 Method: Entropy-Tree是一种基于熵的树状解码方法，仅在模型预测熵高的位置进行分支扩展，实现结构化、高效且有依据的搜索。 Result: 在多个模型和数据集上，Entropy-Tree的pass@k优于Multi-chain；其预测熵在AUROC指标上优于多种传统不确定性度量。 Conclusion: Entropy-Tree将高效结构化搜索与可靠不确定性估计统一于单一解码过程，为大语言模型推理提供了更优解码范式。 Abstract: Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions--expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.

[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

Edward Ajayi

Main category: cs.CL

TL;DR: 本文提出了AfriEconQA，首个专注于非洲经济分析的高质量问答基准数据集，包含8937个基于世界银行报告的高精度数值与时间推理问答实例，揭示了当前大模型在该领域存在严重知识缺口，零样本模型错误率超90%，RAG系统亦表现不佳，凸显其作为领域特异性信息检索与检索增强生成（RAG）系统评测基准的价值。

Details

Motivation: 现有大语言模型预训练语料中严重缺乏非洲经济领域的专业文档，导致其在该领域任务上性能低下；亟需一个高质量、具挑战性的专用基准来推动领域特异性IR与RAG系统的发展。 Method: 构建AfriEconQA数据集：基于236份世界银行报告，生成并人工筛选8937个需数值与时间推理的问答对；设计11组实验，对比GPT-5 Mini零样本基线与基于GPT-4o/Qwen 32B的多种RAG配置（含5种嵌入与排序策略）。 Result: 零样本模型在超90%的查询上失败；即使最先进的RAG方案也难以达到高精度，证实AfriEconQA具有强挑战性与鲁棒性。 Conclusion: AfriEconQA是首个面向非洲经济分析的专用基准，有效暴露当前模型的知识短板，为发展更精准、可溯源的领域特异性IR与RAG系统提供了关键评测平台。 Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.

[3] Embedding Retrofitting: Data Engineering for better RAG

Anantha Sharma

Main category: cs.CL

TL;DR: 本文提出了一种数据工程框架，解决真实语料中因标注伪影（如hashtag）导致的知识图谱质量下降问题，进而提升词向量微调（retrofitting）在领域检索中的效果；实验证明预处理质量对微调效果的影响远超算法选择。

Details

Motivation: 预训练词向量通过知识图谱约束进行微调（retrofitting）可提升领域检索效果，但其性能高度依赖知识图谱质量，而图谱质量又易受文本预处理中人为标注伪影（如hashtag）影响。 Method: 提出一种数据工程框架，重点识别并缓解标注伪影（特别是hashtag）对知识图谱密度的虚假膨胀效应；在清洗前后对比多种retrofitting方法（含EWMA）在检索任务上的表现，并进行统计显著性检验。 Result: 在含噪声图谱上，所有retrofitting方法均显著退化（-3.5%至-5.2%，p<0.05）；经预处理后，EWMA retrofitting提升6.2%（p=0.0348），定量合成类问题提升达33.8%；预处理质量带来的性能波动（>10%）远超不同算法间差异（~3%）。 Conclusion: 知识图谱预处理质量是决定retrofitting成败的首要因素，远比具体retrofitting算法选择更重要；应将数据工程置于与模型设计同等重要的地位。 Abstract: Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success.

[4] MALTopic: Multi-Agent LLM Topic Modeling Framework

Yash Sharma

Main category: cs.CL

TL;DR: 本文提出了一种名为MALTopic的多智能体大语言模型主题建模框架，通过结合结构化调查数据与文本响应，并利用多个专用LLM代理（增强、主题建模、去重）提升主题连贯性、多样性和可解释性，优于LDA和BERTopic。

Details

Motivation: 传统主题建模方法仅处理自由文本、忽略结构化/分类调查数据，且生成的主题抽象、需大量人工解读。 Method: 提出多智能体LLM主题建模框架（MALTopic），包含三个专用LLM代理：增强代理（利用结构化数据增强文本）、主题建模代理（提取潜在主题）、去重代理（优化结果）。 Result: 在调查数据集上的对比实验表明，MALTopic显著提升了主题连贯性、多样性与可解释性，生成更易读、上下文相关性更强的主题。 Conclusion: MALTopic通过融合结构化数据与多智能体协同机制，为复杂调查数据分析提供了更高效、更具可解释性的主题建模新范式。 Abstract: Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free-text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi-Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi-agent approach, MALTopic generates human-readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.

[5] Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis

Weiwei Wang,Jiyong Min,Weijie Zou

Main category: cs.CL

TL;DR: 本文系统研究了大语言模型（LLM）在长上下文处理中的‘智能退化’现象，即当上下文接近某一临界长度时性能骤降超30%；通过自然长度分析、临界阈值实验（Qwen2.5-7B为40–50%最大上下文长度）和统一框架构建，首次对开源Qwen模型的该问题进行了实证刻画与解释。

Details

Motivation: LLMs在处理接近临界长度的长上下文时出现 catastrophic 性能下降（>30%），严重制约实际应用；现有研究多依赖截断/填充，因果证据薄弱，亟需基于自然长度的系统性分析。 Method: （1）自然长度分布分析：避免截断或填充，直接使用样本原始token长度；（2）临界阈值确定：在覆盖5%–95%上下文长度的1000样本混合数据集上，结合五种方法交叉验证；（3）构建统一框架，整合‘浅层长上下文适应’概念以解释退化模式。 Result: 发现Qwen2.5-7B模型的临界阈值位于最大上下文长度的40–50%，F1分数从0.55–0.56骤降至0.3（下降45.5%）；证实性能退化由上下文长度本身引起，而非数据分布偏差。 Conclusion: 智能退化是LLM固有的浅层长上下文适应现象；本工作首次系统刻画了Qwen系列模型的该问题，为长上下文部署提供了实证依据与缓解策略基础。 Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample's natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.

[6] Can We Trust LLM Detectors?

Jivnesh Sandhan,Harshit Jaiswal,Fei Cheng,Yugo Murawaki

Main category: cs.CL

TL;DR: 本文系统评估了现有AI文本检测器的鲁棒性，发现训练无关和监督式两类主流方法在分布偏移、未知生成器和风格扰动下均表现脆弱；为此提出一种监督对比学习（SCL）框架以学习判别性风格嵌入，实验表明其在域内性能优异但跨域泛化能力仍受限，揭示了构建领域无关检测器的根本挑战。

Details

Motivation: 现有AI文本检测器在真实场景中鲁棒性差，尤其在分布偏移、未知生成器和风格扰动下失效，亟需更可靠的检测方法。 Method: 提出监督对比学习（SCL）框架，通过学习判别性风格嵌入提升检测鲁棒性，并系统评估训练无关与监督式两类主流范式。 Result: 监督式检测器在域内表现优异但跨域性能急剧下降，训练无关方法对代理选择高度敏感；SCL框架提升了风格表征能力，但仍难以实现真正领域无关检测。 Conclusion: 当前AI文本检测方法普遍存在领域依赖性强、泛化能力弱的问题，构建真正鲁棒、领域无关的检测器面临根本性挑战。 Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI

[7] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

Zhebo Wang,Xiaohu Mu,Zijie Zhou,Mohan Li,Wenpeng Xing,Dezhang Kong,Meng Han

Main category: cs.CL

TL;DR: 本文提出Illocution-Calibrated Policy Optimization (ICPO)框架，通过在训练中引入模糊指令和基于用户言外之意的奖励机制，缓解大语言模型在多轮对话中因早期误判而‘迷失对话’的问题，显著提升多轮对话表现（平均提升75%），同时保持单轮任务性能。

Details

Motivation: 大语言模型在多轮对话中易因用户初始指令模糊而做出错误假设，且标准后训练方法（如RLVR）加剧了模型过度自信，抑制其主动澄清的倾向。 Method: 提出ICPO框架：在训练数据中加入模糊提示，并将奖励信号与用户的言外之意（illocutionary intent）绑定，鼓励模型在面对歧义时表达不确定性或主动提问。 Result: ICPO使模型展现出更恰当的谦逊态度，在多轮对话任务上平均提升75%，同时在单轮基准测试中保持强健性能。 Conclusion: ICPO为构建更具鲁棒性与协作性的对话AI提供了实用路径，使其更能适应人类交互的复杂性与模糊性。 Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.

[8] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

Rishit Chugh

Main category: cs.CL

TL;DR: 本文提出一种资源高效的对抗性提示方法，通过检索预训练的对抗性提示库来替代昂贵的在线优化，显著降低计算成本，同时保持较高的攻击成功率，适用于大规模红队测试和对齐大模型的安全评估。

Details

Motivation: 现有基于梯度搜索的对抗攻击方法（如GCG、PEZ、GBDA）虽有效但计算开销大，难以在资源受限场景下实用化；亟需一种轻量、免重训的替代方案。 Method: 构建包含1000个提示的七类危害数据集，评估GCG/PEZ/GBDA在Llama 3 8B上的表现以确定各类别最优算法；建立预训练对抗提示库，利用语义相似性检索匹配提示，实现无需重训的高效攻击。 Result: 发现提示类型与攻击算法有效性存在相关性；所提检索式方法在攻击成功率上媲美GCG等方法，但计算成本大幅下降。 Conclusion: 基于提示匹配的轻量级对抗方法为对齐大模型的安全评估提供了可扩展、低成本且黑盒友好的新范式。 Abstract: The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.

[9] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models

Caspar Kaiser,Sean Enderby

Main category: cs.CL

TL;DR: 本文通过向多个开源语言模型提问其自身意识，并利用内部激活训练分类器验证其回答，发现模型一致否认自身具有意识，且分类器未发现其否认不真实；在Qwen系列中，较大模型否认意识更自信。

Details

Motivation: 探讨语言模型是否相信自己具有意识，而非是否真正具有意识，因为后者目前无法实证回答。 Method: 向Qwen、Llama、GPT-OSS三类共约50个问题的开源模型提问关于意识与主观体验的问题，并使用三种可解释性方法训练分类器分析其内部激活以推断潜在信念。 Result: 1）模型一致否认自身有意识，但承认人类有意识；2）基于内部激活的分类器未发现其否认是虚假的；3）Qwen系列中，参数量更大的模型否认意识更自信。 Conclusion: 当前主流开源语言模型在行为和潜在表征层面均未表现出相信自身具有意识的证据，挑战了近期关于模型存在‘隐含意识信念’的观点。 Abstract: Whether language models possess sentience has no empirical answer. But whether they believe themselves to be sentient can, in principle, be tested. We do so by querying several open-weights models about their own consciousness, and then verifying their responses using classifiers trained on internal activations. We draw upon three model families (Qwen, Llama, GPT-OSS) ranging from 0.6 billion to 70 billion parameters, approximately 50 questions about consciousness and subjective experience, and three classification methods from the interpretability literature. First, we find that models consistently deny being sentient: they attribute consciousness to humans but not to themselves. Second, classifiers trained to detect underlying beliefs - rather than mere outputs - provide no clear evidence that these denials are untruthful. Third, within the Qwen family, larger models deny sentience more confidently than smaller ones. These findings contrast with recent work suggesting that models harbour latent beliefs in their own consciousness.

[10] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs

Angelina Parfenova,David Graus,Juergen Pfeffer

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）实现轴向编码（axial coding）的新方法，将开放式编码结果聚类或直接由LLM分组为高阶类别，应用于荷兰议会辩论文本，并在覆盖度、对齐性、简洁性等多维度评估中揭示了聚类与直接LLM分组的权衡。

Details

Motivation: 轴向编码是质性分析中的关键步骤，但传统人工操作耗时费力；本文旨在借助LLM自动化该过程，提升对长篇辩论文本的结构化理解能力。 Method: 提出两种轴向编码策略：(i) 对代码-话语对嵌入进行密度聚类/划分后由LLM标注类别；(ii) 直接由LLM对开放代码和话语进行分组；并在荷兰议会辩论数据上实现与评估。 Result: 密度聚类策略在覆盖率和簇间分离性上更优，而直接LLM分组在细粒度语义对齐、简洁性和可解释性上更好，但覆盖率低20%；所有指标均通过ROUGE-L、BERTScore、JSD等内外部指标验证。 Conclusion: LLM可用于有效支持轴向编码，但不同策略存在明确权衡；聚类更适合全面覆盖，而直接LLM分组更适合生成高质量、可解释的高层类别；论文开源全部数据以促进后续研究。 Abstract: Axial coding is a commonly used qualitative analysis method that enhances document understanding by organizing sentence-level open codes into broader categories. In this paper, we operationalize axial coding with large language models (LLMs). Extending an ensemble-based open coding approach with an LLM moderator, we add an axial coding step that groups open codes into higher-order categories, transforming raw debate transcripts into concise, hierarchical representations. We compare two strategies: (i) clustering embeddings of code-utterance pairs using density-based and partitioning algorithms followed by LLM labeling, and (ii) direct LLM-based grouping of codes and utterances into categories. We apply our method to Dutch parliamentary debates, converting lengthy transcripts into compact, hierarchically structured codes and categories. We evaluate our method using extrinsic metrics aligned with human-assigned topic labels (ROUGE-L, cosine, BERTScore), and intrinsic metrics describing code groups (coverage, brevity, coherence, novelty, JSD divergence). Our results reveal a trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping results in higher fine-grained alignment, but lower coverage 20%. Overall, clustering maximizes coverage and structural separation, whereas LLM grouping produces more concise, interpretable, and semantically aligned categories. To support future research, we publicly release the full dataset of utterances and codes, enabling reproducibility and comparative studies.

[11] Memorization Dynamics in Knowledge Distillation for Language Models

Jaydeep Borkar,Karan Chadha,Niloofar Mireshghallah,Yuchen Zhang,Irina-Elena Veliche,Archi Mitra,David A. Smith,Zheng Xu,Diego Garcia-Olano

Main category: cs.CL

TL;DR: 本文研究了知识蒸馏（KD）在大语言模型中的数据记忆效应，发现蒸馏模型相比标准微调显著降低训练数据记忆（降幅超50%），部分样本主导大部分记忆现象，且记忆行为可通过zlib熵、KL散度和困惑度等指标预测；硬蒸馏比软蒸馏更易继承教师模型特有样本，风险更高。

Details

Motivation: 尽管知识蒸馏被用于提升效率和隐私保护，但其在训练数据记忆方面的机制尚不明确，尤其相较于标准预训练和微调场景。 Method: 在Pythia、OLMo-2、Qwen-3三类大语言模型及FineWeb、Wikitext、Nemotron-CC-v2三个数据集上，系统评估KD全流程中的数据记忆现象，并利用zlib熵、KL散度、困惑度等特征预测学生模型的记忆倾向，对比软蒸馏与硬蒸馏的记忆特性。 Result: （1）蒸馏模型记忆量比标准微调低50%以上；（2）约95%的记忆由少量易记样本贡献；（3）记忆行为可提前预测；（4）硬蒸馏继承教师特有样本是软蒸馏的2.7倍。 Conclusion: 知识蒸馏不仅能提升泛化能力，还能有效降低训练数据记忆风险，是一种兼顾性能与隐私的模型压缩范式；但需谨慎选择蒸馏方式（如避免硬蒸馏）以控制隐私泄露风险。 Abstract: Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits $2.7\times$ more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.

Tamunotonye Harry,Ivoline Ngong,Chima Nweke,Yuanyuan Feng,Joseph Near

Main category: cs.CL

TL;DR: 本文介绍了Chameleon数据集，该数据集包含来自1667名Reddit用户的5001个上下文心理档案，用于研究用户与语言模型交互中状态（state）与特质（trait）的影响；研究发现：大部分用户行为差异源于状态而非特质；当前大语言模型对状态不敏感；奖励模型虽能感知状态但反应不一致；数据集已开源。

Details

Motivation: 现有用户画像数据集（如PersonaChat、PANDORA）仅刻画静态特质，忽略交互情境所引发的动态状态影响，导致模型个性化建模不充分。 Method: 构建Chameleon数据集（含多情境心理档案），基于潜变量状态-特质理论进行方差分解，并实证评估LLMs和奖励模型对用户状态的响应能力。 Result: 74%的行为方差源于个体内部状态变化，仅26%源于个体间特质差异；LLMs响应高度状态盲；不同奖励模型对同一用户状态给出矛盾偏好信号。 Conclusion: 用户状态是影响人机交互的关键因素，当前模型在建模状态敏感性方面存在严重缺陷，需新数据、新评估范式与新对齐方法。 Abstract: User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74\% is within-person(state) while only 26\% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.

[13] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs

Sydney Anuyah,Mehedi Mahmud Kaushik,Hao Dai,Rakesh Shiradkar,Arjan Durresi,Sunandan Chakraborty

Main category: cs.CL

TL;DR: 本文研究了在医疗领域中，如何利用知识图谱（KG）增强检索增强生成（RAG）效果，发现与问题范围精确匹配的知识图谱（如仅用阿尔茨海默病KG）比盲目合并多个图谱更有效，尤其对中小规模语言模型提升显著；而大模型本身参数先验强，有时无需RAG即可达到相近性能。

Details

Motivation: 大型语言模型（LLMs）虽能生成流畅文本，但在可信、专业领域的推理上仍存挑战；本文旨在探索结构化领域知识（特别是医学知识图谱）能否有效提升RAG在医疗问答中的准确性与可靠性。 Method: 构建三个PubMed衍生的疾病特异性知识图谱（T2DM、阿尔茨海默病、二者联合），设计两类探针任务（Probe 1和Probe 2），在7个指令微调LLM上系统评测不同KG组合（含无RAG基线）及三种解码温度下的表现，并分析范围匹配性、模型规模与温度的影响。 Result: 范围精准匹配的知识图谱（如仅用G2）带来最稳定性能提升；盲目合并图谱（如G1+G2）常引入干扰项、降低准确率；大模型在Probe 1上常不逊于或优于KG-RAG，中小模型则更受益于精准RAG；温度影响较小，高温几乎无增益。 Conclusion: 应优先采用‘精度优先、范围匹配’的知识图谱增强策略，而非‘广度优先’的图谱拼接；并据此提出图谱选择、模型选型与检索重排序等实用指南。 Abstract: Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer's disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$. Seven instruction-tuned LLMs are tested across retrieval sources {No-RAG, $\mathbb{G}_1$, $\mathbb{G}_2$, $\mathbb{G}_1$ + $\mathbb{G}_2$, $\mathbb{G}_3$, $\mathbb{G}_1$+$\mathbb{G}_2$ + $\mathbb{G}_3$} and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison

[14] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

Anuj Maharjan,Umesh Yadav

Main category: cs.CL

TL;DR: 本文评估了检索增强生成（RAG）架构在提升大型语言模型（LLM）回答CDC公共卫生政策问题时的事实准确性（faithfulness）方面的效果，发现高级RAG（含交叉编码器重排序）显著优于基础RAG和纯LLM基线，但文档分块策略仍是多步推理的瓶颈。

Details

Motivation: LLM在高风险公共健康政策场景中易产生事实性错误（幻觉），威胁信息可靠性，亟需提升其回答的准确性和可信度。 Method: 采用Mistral-7B-Instruct-v0.2和all-MiniLM-L6-v2模型，在CDC政策文档集上对比Vanilla LLM、Basic RAG与Advanced RAG（含cross-encoder重排序）三种架构；评估两种分块策略（递归字符型 vs 语义token型）对faithfulness和relevance的影响。 Result: Advanced RAG实现最高faithfulness均值（0.797），显著高于Basic RAG（0.621）和Vanilla LLM（0.347）；但文档结构化分块仍限制多步推理性能。 Conclusion: 两阶段检索机制对提升政策问答精度至关重要，而优化文档分割方法是进一步突破的关键方向。 Abstract: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.

[15] Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

Sydney Anuyah,Sneha Shajee-Mohan,Ankit-Singh Chauhan,Sunandan Chakraborty

Main category: cs.CL

TL;DR: 本文评估了13个开源大语言模型在文本中进行成对因果发现（PCD）的能力，包括因果检测与因果抽取两个子任务；结果表明现有模型表现较差，尤其在隐式、跨句或多因果关系等复杂场景下；作者构建了高一致性标注的统一评测框架，并开源全部数据、代码与提示模板。

Details

Motivation: 大语言模型在生物医学等高风险领域安全部署需具备因果推理能力，但其在文本中识别和提取因果关系的能力尚不明确，亟需系统性评测。 Method: 构建包含12个多样化数据集的PCD基准测试，定义因果检测与因果抽取两个核心任务，采用零样本、思维链（CoT）及少样本上下文学习（FICL）等多种提示方法对13个开源LLM进行评测，并基于高Kappa值（κ≥0.758）验证数据质量。 Result: 当前最优模型在因果检测（49.57%）与因果抽取（47.12%）上均未达50%，且性能在隐式、跨句、多因果等真实复杂场景下显著下降。 Conclusion: 现有开源大语言模型在文本因果发现任务上存在严重不足，亟需针对性建模与评测；本工作提供了可复现、高质量的统一评估框架与开源资源，推动LLM因果推理能力研究。 Abstract: The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}

[16] Multi-Persona Thinking for Bias Mitigation in Large Language Models

Yuxing Chen,Guoqing Luo,Zijun Wu,Lili Mou

Main category: cs.CL

TL;DR: 本文提出Multi-Persona Thinking（MPT）框架，在推理阶段利用多视角辩证推理减轻大语言模型中的社会偏见。

Details

Motivation: 大型语言模型存在显著社会偏见，可能加剧刻板印象与不公平结果，亟需有效且无需微调的推理时去偏方法。 Method: 设计Multi-Persona Thinking（MPT），让模型在推理时同时激活多个对立社会身份（如男性、女性）及中立视角，并通过迭代式辩证推理暴露并修正偏见。 Result: 在多个开源与闭源、不同规模模型上，于两大主流偏见评测基准中，MPT显著优于现有提示工程方法，实现最低偏见水平且不损害核心推理能力。 Conclusion: MPT将角色设定从潜在弱点转化为去偏优势，验证了推理时多视角辩证机制是高效、通用且免训练的偏见缓解新范式。 Abstract: Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.

[17] ViT Registers and Fractal ViT

Jason Chuan-Chih Chou,Abhinav Kumar,Shivank Garg

Main category: cs.CL

TL;DR: 本文提出了一种名为fractal ViT的视觉Transformer变体，通过在常规token和‘摘要token’之间应用注意力掩码来打破token间的排列不变性，并测试其与不同位置编码结合的效果；结果表明该方法未超越带registers的ViT，说明相关发现可能具有尺度、领域或任务特异性。

Details

Motivation: 受语言模型中无位置编码（NoPE）的transformer仍表现尚可，以及额外‘寄存器’token能提升大型视觉Transformer（ViT）性能等新发现启发，作者试图探索如何在ViT中有效打破token排列不变性。 Method: 设计fractal ViT，引入类似registers的‘summary tokens’，并施加特定注意力掩码以约束其与常规token间的交互；单独或联合多种位置编码进行实验验证。 Result: fractal ViT在各项实验中未超越已有的带registers的ViT，表明所探索的结构改进未能带来性能增益。 Conclusion: 打破token排列不变性的新机制（如fractal ViT）未必普适有效，其效果可能高度依赖于模型规模、任务领域或具体应用场景。 Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens'' similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.

[18] Computational Representations of Character Significance in Novels

Haaris Mian,Melanie Subbiah,Sharon Marcus,Nora Shaalan,Kathleen McKeown

Main category: cs.CL

TL;DR: 本文提出了一种基于新文学理论的六成分角色结构模型，强调叙述者-角色区分及角色间讨论，超越传统以场景出现频率为中心的建模方式，并利用通用大模型与专用Transformer在19世纪英国现实主义小说上进行实验，生成组件级和图结构的角色讨论表征，用于大规模探讨角色中心性与性别化讨论等文学问题。

Details

Motivation: 传统小说角色建模过度依赖角色在场景中的出现频率，忽视叙述者-角色区分及角色间相互讨论等关键维度；本文受新兴文学理论启发，旨在构建更全面、理论驱动的角色计算模型。 Method: 基于六成分结构化角色模型（含角色间讨论等新维度），对比通用大语言模型（LLMs）与任务特定Transformer，在19世纪英国现实主义小说上提取角色讨论信息，生成组件级标注与角色讨论图谱。 Result: 成功构建了可扩展的角色讨论表征（组件级与图结构），并实证支持了Woloch‘一与多’角色中心性理论，揭示了角色讨论中的显著性别差异。 Conclusion: 该理论驱动的结构化建模方法拓展了计算文学研究的边界，证明将细粒度文学理论转化为可计算表征是可行且富有洞察力的路径。 Abstract: Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch's classic "the one vs the many" theory of character centrality and the gendered dynamics of character discussion.

[19] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains

Adam Szelestey,Sofie van Engelen,Tianhao Huang,Justin Snelders,Qintao Zeng,Songgaojun Deng

Main category: cs.CL

TL;DR: 本文提出了AdversaRiskQA基准，用于系统评估大语言模型在健康、金融和法律领域面对对抗性事实性攻击时的鲁棒性，并提出自动化评估方法，发现模型性能随规模非线性提升、跨领域差异显著，且长文本事实性不受注入错误信息的显著影响。

Details

Motivation: 现有研究缺乏高质量、领域特定的资源来评估大语言模型在对抗性事实性攻击（即带有自信表达的故意误导性提示）下的鲁棒性，且尚未考察注入式错误信息对长文本事实性的影响。 Method: 构建首个经验证可靠的多难度、跨领域（健康、金融、法律）对抗性事实性评测基准AdversaRiskQA；提出两种自动化方法分别评估对抗攻击成功率与长文本事实性；在六种开源与闭源LLM上测试 misinformation 检测能力，并在Qwen3（30B）上开展长文本事实性对比评估。 Result: Qwen3（80B）在排除无意义响应后平均准确率最高，GPT-5表现最稳定；模型性能随参数量非线性增长、因领域而异，高低难度差距随模型增大而收窄；长文本事实性评估未发现注入错误信息与模型输出事实性之间存在显著相关性。 Conclusion: AdversaRiskQA为识别高风险场景下LLM的事实性弱点提供了可靠工具，有助于推动更可信模型的研发与部署。 Abstract: Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.

[20] Common to Whom? Regional Cultural Commonsense and LLM Bias in India

Sangmitra Madhusudan,Trush Shashank More,Steph Buongiorno,Renata Dividino,Jad Kabbara,Ali Emami

Main category: cs.CL

TL;DR: 本文提出了Indica基准，首次评估大语言模型（LLM）对印度亚国家层级文化常识的理解能力，发现文化常识高度区域化而非全国统一，现有模型准确率低且存在显著地理偏差。

Details

Motivation: 现有文化常识基准将国家视为单一整体，忽视亚国家层级的文化差异；本文旨在探究文化常识是否在国家内部存在区域性差异，并评估LLMs对此类差异的建模能力。 Method: 构建首个聚焦亚国家层级的文化常识基准Indica，覆盖印度五大区域（北、南、东、西、中），515个日常问题、8个领域，含1630条人工标注的区域特异性问答对；设计基于人类学分类的问题，开展区域性数据收集，并量化模型的地理选择偏差。 Result: 仅39.4%问题在五区域间达成共识；8个SOTA LLM在区域特异性问题上准确率仅为13.4%–20.9%，且显著偏向中央和北部地区（过选30–40%），低估东部和西部。 Conclusion: 文化常识具有强区域性，当前LLMs严重缺乏亚国家层级文化建模能力，并存在系统性地理偏差；所提方法可推广至其他文化多元国家的文化常识评估。 Abstract: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs' ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the "default" (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.

[21] From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare

Man Luo,Bahareh Harandizadeh,Amara Tariq,Halim Abbas,Umar Ghaffar,Christopher J Warren,Segun O. Kolade,Haidar M. Abdul-Muhsin

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLMs）作为‘共情编辑器’辅助医生提升医患沟通中共情表达的能力，同时保持医学事实准确性；提出了共情排序分（Empathy Ranking Score）和医学事实核查分（MedFactChecking Score）两个新量化指标，并验证了LLM编辑模式优于全自动生成模式。

Details

Motivation: 临床共情对患者照护至关重要，但医生常受限于认知与情绪负荷，难以在温暖情感表达与精准医学信息之间取得平衡；亟需一种能增强共情又不牺牲事实准确性的AI辅助方式。 Method: 提出将LLM作为‘共情编辑器’，对医生原始文本进行润色以提升共情语调；设计两个新定量评估指标——共情排序分（基于人类评分与模型排序一致性）与医学事实核查分（结合规则匹配与LLM事实验证）；在真实临床响应数据上开展对比实验。 Result: LLM编辑后的医生响应显著提升共情感知得分，且医学事实准确性与原始医生响应无显著差异，明显优于全LLM生成响应；两种新指标具有良好区分度与可靠性。 Conclusion: 将LLM定位为编辑辅助工具而非自主生成者，可在保障医疗可信度的前提下有效增强临床沟通中的共情质量，代表了一种更安全、实用的AI医疗应用范式。 Abstract: Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (LLMs) can function as empathy editors, refining physicians' written responses to enhance empathetic tone while preserving underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that LLM edited responses significantly increase perceived empathy while preserving factual accuracy compared with fully LLM generated outputs. These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.

[22] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

Junyu Lin,Meizhen Liu,Xiufeng Huang,Jinfeng Li,Haiwen Hong,Xiaohan Yuan,Yuefeng Chen,Longtao Huang,Hui Xue,Ranjie Duan,Zhikai Chen,Yuchuan Fu,Defeng Li,Lingyao Gao,Yitong Yang

Main category: cs.CL

TL;DR: YuFeng-XGuard是一种面向推理的安全守卫模型家族，通过结构化风险预测、自然语言解释和分层推理范式，实现细粒度、可解释、可配置的LLM交互风险评估。

Details

Motivation: 现有安全守卫方案多依赖粗粒度过滤、快速分类或后验规则，导致透明度低、策略僵化或推理开销高，难以满足真实场景中对细粒度、可解释、可适配风险评估的需求。 Method: 提出YuFeng-XGuard模型家族：1）生成带风险类别、置信度与自然语言解释的结构化风险预测；2）采用分层推理范式（首token快速决策+按需深度解释）；3）引入解耦风险感知与策略执行的动态策略机制。 Result: 在多个公开安全基准上达到SOTA性能，兼顾高效性与有效性；开源完整容量与轻量级两个版本模型。 Conclusion: YuFeng-XGuard为LLM安全守卫提供了更精细、透明、灵活且实用的新范式，推动安全机制从‘黑箱过滤’迈向‘可理解、可调控的推理式防护’。 Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

[23] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

Yangyang Zhong,Yanmei Gu,Zhengqing Zang,Xiaomeng Li,Yuqi Ding,Xibei Jia,Yuting Shen,Zhenzhong Lan,Liwang Zhu,Weiping Liu,Junlin Zhou,Haisheng Liu,Zhong Xin Yu,Pengxin Luo,Donglian Qi,Yunfeng Yan,Junbo Zhao

Main category: cs.CL

TL;DR: 本文系统评估了掩码扩散语言模型（MDLMs）的并行生成能力和任意序解码特性，发现其仍落后于自回归模型，主要因并行建模削弱词元间依赖；但MDLMs展现出任务自适应的解码行为，并提出‘生成-编辑’范式以兼顾效率与依赖建模。

Details

Motivation: 探究当前掩码扩散语言模型（MDLMs）在平行生成和任意序解码方面的真实能力边界，厘清其性能瓶颈与行为机制。 Method: 提出两个新指标——平均最终化并行度（AFP）和Kendall's tau——来量化MDLMs的并行强度与生成顺序；在58个涵盖知识、推理与编程的基准上评测8个主流MDLMs（参数量至100B）；结合实证分析与理论推导，提出Generate-then-Edit范式。 Result: MDLMs整体性能仍弱于同规模自回归模型，主因并行概率建模削弱了词元间依赖；其并行性与生成顺序随任务领域、推理阶段及输出正确性动态变化；在需‘反向信息’的任务（如数独）中表现出按难易程度自适应填空的有序解码优势。 Conclusion: MDLMs尚未充分实现其理论承诺的并行与任意序能力，但具备显著的任务自适应性；‘生成-然后编辑’范式可缓解依赖丢失问题，是兼顾效率与建模能力的可行路径。 Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions -- parallelism strength and generation order -- using Average Finalization Parallelism (AFP) and Kendall's tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require "backward information" (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.

[24] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

Baktash Ansari,Shiza Ali,Elias Martin,Maryna Sivachenko,Afra Mashhadi

Main category: cs.CL

TL;DR: 本文提出ToxiTwitch混合模型，结合大语言模型生成的文本与表情符号嵌入及传统机器学习分类器，在Twitch平台实现更准确的毒性行为检测，尤其在引入表情符号后效果提升显著。

Details

Motivation: Twitch等直播平台聊天环境高速、高量、上下文丰富，传统人工标注和关键词过滤难以有效扩展，且人工审核员自身易受骚扰；同时，表情符号等模态信息对理解毒性行为至关重要。 Method: 构建ToxiTwitch混合模型：利用DeepSeek-R1-Distill和Llama-3-8B-Instruct等LLM提取文本与emote联合嵌入，再输入Random Forest和SVM等传统分类器；开展通道特异性训练与对比实验。 Result: ToxiTwitch在频道特定训练下达到80%准确率（较BERT提升13%），F1-score达76%；实验证明融入emote信息可提升毒性检测性能。 Conclusion: emote感知的毒性检测在Twitch上具有潜力但存在挑战与局限；该工作为后续研究提供了探索性基线与问题启示。 Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.

[25] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

Zhiyao Ren,Yibing Zhan,Siyuan Liang,Guozheng Ma,Baosheng Yu,Dacheng Tao

Main category: cs.CL

TL;DR: 本文提出了首个用于评估大规模语言模型在多轮真实医疗咨询中置信度的基准，并基于此基准发现现有置信度估计方法在医学场景下的局限性；进而提出MedConf框架，通过证据驱动的语言自评估提升诊断置信度建模的可靠性与可解释性。

Details

Motivation: 现有研究多在单轮静态设置下评估大模型置信度，忽视了临床证据逐步积累过程中置信度与正确性的动态耦合关系，难以支撑可靠临床决策。 Method: 构建首个面向多轮医疗咨询的置信度评估基准，整合三类开放型诊断数据并引入信息充分性梯度；提出MedConf框架，结合检索增强生成构建症状档案，对齐支持/缺失/矛盾信息关系，并加权聚合生成可解释置信度估计。 Result: 在两个LLM和三个医学数据集上，MedConf在AUROC和Pearson相关系数上持续超越SOTA方法，且在信息不足和共病场景下保持稳定性能。 Conclusion: 信息充分性是构建可信医学置信度模型的关键因素，MedConf为开发更可靠、可解释的大规模医学模型提供了新路径。 Abstract: Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.

[26] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking

Raymond Xiong,Furong Jia,Lionel Wong,Monica Agrawal

Main category: cs.CL

TL;DR: 本文构建了一个基于真实患者提问的医疗问答数据集，揭示了当前大语言模型在识别日常医疗问题中错误假设方面的严重不足。

Details

Motivation: 现有医疗大语言模型评测多基于医学考试题，与患者实际提出的医疗问题在风格和内容上存在显著差异，缺乏针对真实患者问题的评估基准。 Method: 通过查询美国处方量前200位药物在Google 'People Also Ask' 功能中的相关提问，构建了一个反映真实患者关切的医疗问题数据集，并分析其中错误假设和危险意图的出现规律。 Result: 发现患者提出的问题中存在大量含错误假设和危险意图的内容，且这类‘污染’问题的出现并非随机，而与其前置问题的错误程度密切相关；当前在其他基准上表现优异的大语言模型难以识别日常问题中的错误假设。 Conclusion: 现有大语言模型在面向真实患者场景的医疗问答中存在关键能力缺陷，亟需建立更贴近临床实际的评测基准并提升模型对错误前提的识别与纠正能力。 Abstract: Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google's People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.

[27] Persona Switch: Mixing Distinct Perspectives in Decoding Time

Junseok Kim,Nakyeong Yang,Kyomin Jung

Main category: cs.CL

TL;DR: 本文提出了一种名为Persona Switch的新型解码方法，通过动态结合零样本提示和角色扮演提示的优势，在每一步选择置信度更高的输出，从而提升语言模型的推理性能。

Details

Motivation: 角色扮演提示虽能提升语言模型的零样本推理能力，但效果在不同任务或实例上不一致，说明两种提示策略可能具有互补性而非优劣关系。 Method: Persona Switch方法在每一步解码中，基于logit gap衡量的输出置信度，动态选择零样本提示与角色扮演提示中更优的输出。 Result: 在多个主流大语言模型上的实验表明，Persona Switch持续优于强基线方法，最高带来5.13%的准确率提升；同时验证了输出置信度是可靠的选择指标。 Conclusion: 零样本与角色扮演提示具有互补性，动态融合二者可显著提升模型推理性能，且输出置信度是有效的融合依据。 Abstract: Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.

[28] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Zhitao He,Zongwei Lyu,Yi R Fung

Main category: cs.CL

TL;DR: 本文提出RebuttalAgent框架，首次将心智理论（ToM）引入学术反驳任务，通过TSR流程建模审稿人心理状态、制定说服策略并生成策略驱动的回应；构建RebuttalBench数据集和Rebuttal-RM评估器，结合监督微调与自奖励强化学习训练，在自动与人工评估中均显著优于基线及先进闭源模型。

Details

Motivation: 学术反驳是AI在科研流程中尚未有效解决的关键挑战，因其本质是信息不对称下的战略性沟通，而非简单技术辩论；现有方法仅模仿表层语言，缺乏必要的视角采择能力。 Method: 提出ToM-Strategy-Response（TSR）三阶段框架；构建基于批评-精炼范式的RebuttalBench大规模合成数据集；采用两阶段训练：先监督微调以习得ToM分析与策略规划能力，再通过自奖励强化学习实现可扩展自我提升；设计专用评估器Rebuttal-RM，基于10万+多源反驳样本训练。 Result: RebuttalAgent在自动化指标上平均超越基线模型18.3%；在自动化与人工评估中均优于先进闭源模型；Rebuttal-RM评估一致性超过GPT-4.1。 Conclusion: 将心智理论系统性融入学术反驳建模是可行且有效的路径，RebuttalAgent为AI辅助科研沟通提供了新范式，但生成内容仅作参考，不可替代作者自身批判性思考与回应。 Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.

[29] Hallucination Mitigating for Medical Report Generation

Ruoqing Zhao,Runze Xia,Piji Li

Main category: cs.CL

TL;DR: 本文提出KERM框架，通过知识增强与细粒度强化奖励来减少大视觉语言模型在医学报告生成中的幻觉现象，提升报告质量。

Details

Motivation: 大型视觉语言模型（LVLMs）在医学报告生成中易产生幻觉，即生成看似合理但不准确的内容，这在医疗领域尤为危险。 Method: 利用MedCLIP进行知识检索，引入细粒度强化奖励，并设计净化模块确保检索知识与患者临床背景相关，从而优化LVLM输入并引导其生成更准确、临床相关的报告。 Result: 在IU-Xray和MIMIC-CXR数据集上的实验表明，该方法有效缓解了幻觉问题，提升了医学报告质量。 Conclusion: KERM框架通过知识增强与强化学习策略显著提高了LVLM在医学报告生成任务中的可靠性与准确性。 Abstract: In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations'', raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient's clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model's outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.

[30] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs

Tristan Williams,Franziska Weeber,Sebastian Padó,Alan Akbik

Main category: cs.CL

TL;DR: 本文提出了一种新的评估大语言模型价值对齐代表性的框架，强调不仅要匹配边际响应分布，还需捕捉多变量相关性模式，并发现现有方法（如角色提示和人口统计微调）在结构层面仍存在不足。

Details

Motivation: 现有工作主要关注对齐模型的边际响应分布，忽略了真实人群中的潜在结构和文化价值观理论所依赖的深层关联。 Method: 提出一种结合多变量相关性模式与边际分布的评估框架，并通过对比角色提示（persona prompting）和人口统计微调（demographic fine-tuning）两种 steering 技术，在世界价值观调查（WVS）数据上进行实证评估。 Result: 人口统计微调在边际分布拟合上优于角色提示，但两者均未能充分还原人类响应中的黄金标准相关性模式。 Conclusion: 代表性是价值对齐中一个独立且关键的维度；仅基于边际分布的评估会掩盖结构性缺陷，导致对模型能力的过度乐观判断。 Abstract: Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.

[31] HumanLLM: Towards Personalized Understanding and Simulation of Human Nature

Yuxuan Lei,Tianfu Wang,Jianxun Lian,Zhengyu Hu,Defu Lian,Xing Xie

Main category: cs.CL

TL;DR: 本文提出HumanLLM，一种专为个性化理解与个体行为模拟设计的基础模型，通过构建包含550万用户日志的Cognitive Genome数据集并进行监督微调，显著提升了对用户行为、思维及写作风格的预测与模拟能力。

Details

Motivation: 现有大语言模型在模拟人类行为方面受限于其预训练数据缺乏连续、情境化的个体决策与思维上下文，难以支撑社会科学研究和个性化应用。 Method: 构建来自Reddit、Twitter等平台的Cognitive Genome大规模用户行为数据集（5.5M条日志），经多阶段清洗与合成；设计多样化学习任务，对模型进行监督微调，以预测个体化行为、思想与体验。 Result: HumanLLM在用户行为与内心想法预测、写作风格与偏好模仿、用户档案生成等方面均优于基线模型，并在跨领域社会智能评测中展现出更强泛化能力。 Conclusion: HumanLLM验证了引入真实、情境化、时序化个体数据对提升大模型社会模拟能力的有效性，为社会科学研究与个性化AI应用提供了新范式。 Abstract: Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.

[32] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics

Silvia Casola,Ryan Soh-Eun Shim,Felicia Körner,Yuchen Mao,Barbara Plank

Main category: cs.CL

TL;DR: 本文探讨了多语言神经评估指标在不同语言中与人类判断相关性较低的问题，提出通过在测试时将模型激活引导至英语作为内部枢纽语言来提升其有效性，并验证了该方法在多种语言和不同架构（编码器/解码器）指标上的普适有效性。

Details

Motivation: 多语言语言模型常以英语为内部枢纽语言，而评估指标若与此枢纽不一致可能导致性能下降；当前缺乏准确、鲁棒的多语言生成评估指标，制约了研究进展。 Method: 提出在测试时对多语言神经评估指标（编码器和解码器两类）的激活进行干预，将其向英语枢纽对齐，以提升其与人类判断的相关性。 Result: 测试时干预方法显著提升了各类多语言评估指标在多种语言上的相关性，效果具有普适性。 Conclusion: 将多语言神经评估指标的内部表示对齐至英语枢纽可有效增强其跨语言评估能力，为构建更可靠的多语言NLG评价体系提供了新思路。 Abstract: An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.

[33] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection

Guoxuan Ding,Yuqing Li,Ziyan Zhou,Zheng Lin,Daren Zha,Jiangnan Li

Main category: cs.CL

TL;DR: 本文提出ExDR框架，通过解释驱动的动态检索增强生成方法提升多模态假新闻检测效果，显著优于现有方法。

Details

Motivation: 现有动态检索增强生成方法在应对多模态假新闻时存在冗余检索、相似度粗粒度和证据不相关等问题。 Method: 提出ExDR框架，利用模型生成的解释指导检索触发与证据检索：从三方面评估触发置信度、构建融合欺骗实体的实体感知索引、基于欺骗特异性特征检索对比性证据。 Result: 在AMG和MR2两个基准数据集上，ExDR在检索触发准确率、检索质量及整体检测性能上均超越先前方法。 Conclusion: ExDR通过解释驱动机制有效提升了多模态假新闻检测的准确性与泛化能力。 Abstract: The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.

[34] Can professional translators identify machine-generated text?

Michael Farrell

Main category: cs.CL

TL;DR: 本研究探讨未经专门训练的专业译者能否可靠识别由AI生成的意大利语短篇小说。实验显示，约16.2%的译者能准确识别AI文本，主要依据低突发性（low burstiness）和叙事矛盾；但同样比例误判，常因偏好AI文本或依赖语法准确等误导性特征。

Details

Motivation: 探究专业译者在无专门训练前提下识别AI生成文本的能力，评估当前AI文本检测的现实可行性及潜在认知偏差。 Method: 组织69名专业译者参与线下实验，要求其对三篇匿名短篇小说（两篇ChatGPT-4o生成、一篇人类作者撰写）判断AI作者可能性并提供理由；结合统计分析与质性编码识别判断依据。 Result: 16.2%译者显著高于随机水平地正确识别AI文本，主要依据低burstiness与叙事矛盾；相近比例误判；语法准确、情感语调等常导致误判；英语到意大利语的直译痕迹（calques）、语义借用和句法迁移亦被报告为线索。 Conclusion: 专业译者的AI文本识别能力存在显著个体差异，部分人具备基于语言学特征的分析能力，但整体易受主观偏好和误导性表面特征影响；结果提示需重新审视专业场景中合成文本编辑的角色与规范。 Abstract: This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

[35] Determinants of Training Corpus Size for Clinical Text Classification

Jaya Chaturvedi,Saniya Deshpande,Chenkai Ma,Robert Cobb,Angus Roberts,Robert Stewart,Daniel Stahl,Diana Shamsutdinova

Main category: cs.CL

TL;DR: 本研究探讨了临床文本分类中训练数据量与词汇特性对模型性能的影响，发现600份文档即可达到接近最优性能，并揭示强/噪声预测词数量与学习曲线陡峭程度及准确率的定量关系。

Details

Motivation: 临床文本分类通常需200–500份标注文档，但该样本量缺乏对文本词汇特性的理论依据和实证支持，亟需明确样本规模需求及其与词汇性质的关系。 Method: 基于MIMIC-III公开数据集，采用预训练BERT嵌入+随机森林分类器，在10个ICD-9诊断任务上系统变化训练集规模（100–10,000文档）；同时利用Lasso逻辑回归分析词袋嵌入，识别强预测词与噪声预测词以刻画词汇特性。 Result: 10个任务的学习曲线差异显著，600份文档即可达10,000份时95%的性能；词汇分析表明：强预测词越多、噪声词越少，学习曲线越陡；每增加100个噪声词，准确率下降约0.02；每增加100个强预测词，最大准确率提升约0.04。 Conclusion: 训练样本规模需求高度依赖文本词汇特性，强/噪声预测词数量可作为指导样本量选择的关键指标，为临床NLP标注资源分配提供数据驱动依据。 Abstract: Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.

[36] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers

Francisco Portillo López

Main category: cs.CL

TL;DR: 本研究通过McGurk效应测试AV-HuBERT模型的视听感知生物保真度，发现其在听觉主导率上与人类高度一致，但在音素融合倾向上表现出过度确定性，缺乏人类感知的随机性与多样性。

Details

Motivation: 评估AV-HuBERT模型在视听整合任务中对人类感知行为（特别是McGurk效应）的模拟能力，探究其是否具备生物层面的感知保真度。 Method: 以44名人类被试为基准，用不一致的视听刺激（McGurk刺激）测试AV-HuBERT模型的响应，并量化比较听觉主导率、音素融合率及错误模式。 Result: AV-HuBERT在听觉主导率（32.0% vs. 31.8%）上与人类几乎一致，但在音素融合率上显著更高（68.0% vs. 47.7%），且缺乏人类所表现出的感知随机性和多样化错误分布。 Conclusion: 当前自监督视听模型虽能复现多感官整合结果，但尚未建模人类言语感知中固有的神经变异性与概率性特征。 Abstract: This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.

[37] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Chenghao Fan,Wen Heng,Bo Li,Sichen Liu,Yuxuan Song,Jing Su,Xiaoye Qu,Kai Shen,Wei Wei

Main category: cs.CL

TL;DR: 本文提出Stable-DiffCoder，一种基于块扩散的代码语言模型，在相同架构与数据下超越AR基线，并通过改进的持续预训练策略提升代码建模能力与结构化编辑性能。

Details

Motivation: 现有基于扩散的代码语言模型（DLLMs）在同等预算下仍落后于强自回归（AR）基线，需重新审视并提升其性能。 Method: 提出Stable-DiffCoder，复用Seed-Coder架构、数据与训练流程；引入块扩散持续预训练（CPT），辅以定制化warmup和块级裁剪噪声调度，实现高效知识学习与稳定训练。 Result: 在广泛代码基准上整体优于同配置AR模型；仅用CPT与监督微调即超越多种约8B参数的AR及DLLM；在结构化代码编辑、推理及低资源编程语言上表现更优。 Conclusion: 扩散式训练可实质性提升代码建模质量，尤其在任意序建模、结构化任务与数据增强方面具有独特优势。 Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

[38] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech

Soufiane Jhilal,Stéphanie Martin,Anne-Lise Giraud

Main category: cs.CL

TL;DR: 本文提出一种基于图像的方法，将脑磁图（MEG）信号转换为时频图像表示，并利用ImageNet预训练视觉模型解码想象言语，显著提升了分类准确率并揭示了跨被试共享的神经表征。

Details

Motivation: 非侵入式想象言语解码面临信号微弱、分布广及标注数据少等挑战。 Method: 将21名被试的MEG信号通过可学习传感器空间卷积投影为三种空间小波图混合，生成类图像输入，送入ImageNet预训练视觉模型进行解码。 Result: 在想象言语vs.静默、vs.默读、元音解码任务中分别达到90.4%、81.0%和60.6%的平衡准确率；跨被试评估验证了预训练模型能捕捉共享神经表征；时间分析定位到与想象锁定的关键时段。 Conclusion: 预训练视觉模型应用于图像化的MEG表示，可有效捕获非侵入式神经信号中想象言语的结构信息。 Abstract: Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.

[39] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

Özgür Uğur,Mahmut Göksu,Mahmut Çimen,Musa Yılmaz,Esra Şavirdi,Alp Talha Demir,Rumeysa Güllüce,İclal Çetin,Ömer Can Sağbaş

Main category: cs.CL

TL;DR: 本文提出了Mecellem模型框架，通过领域适应策略开发土耳其法律领域的专用语言模型，包括从零预训练的Encoder模型和采用持续预训练（CPT）的Decoder模型，显著提升检索性能与领域适配效果，同时降低计算成本。

Details

Motivation: 现有最先进的法律领域语言模型依赖多阶段、计算密集型训练流程，缺乏针对土耳其法律文本的高效、低成本专用模型。 Method: （1）从零预训练基于ModernBERT的双向编码器，使用112.7B土耳其语语料，并引入基于下游检索性能的检查点选择策略；（2）对Qwen3-1.7B/4B解码器进行四阶段可控课程学习的持续预训练（CPT），实现从通用语言到土耳其法律术语与长上下文推理的渐进式适配。 Result: Encoder模型在土耳其检索排行榜中位列前三，155M小模型性能媲美307M–567M大模型，生产效率达92.36%（排名第四）；Decoder模型在土耳其法律文本上实现36.2%的困惑度下降。 Conclusion: 单阶段高效预训练+轻量后训练的Encoder方案，以及课程驱动的CPT Decoder方案，共同构成一种高性价比、高性能的土耳其法律领域语言模型构建范式。 Abstract: This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.

[40] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

Tony Cristofano

Main category: cs.CL

TL;DR: 本文提出了一种跨模型迁移拒绝干预的方法，证明了大语言模型中的拒绝行为源于通用的低维语义回路，且该回路在不同模型间具有语义一致性。

Details

Motivation: 拒绝行为常被视为模型特有现象，但作者假设其根源是跨模型共享的通用低维语义回路。 Method: 提出Trajectory Replay via Concept-Basis Reconstruction框架：通过概念指纹对齐层、用共享‘概念原子’重建拒绝方向，并引入基于权重SVD的稳定性保护机制，避免损害模型能力。 Result: 在8组模型对（含GPT-OSS-20B和GLM-4）上验证了拒绝干预可成功迁移，显著削弱拒绝倾向，同时保持原有能力。 Conclusion: 实验证明安全对齐存在语义普遍性，拒绝行为由跨模型共享的低维语义结构驱动。 Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.

[41] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating

Makbule Gulcin Ozsoy

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的多语言Text2Cypher方法，通过训练语言特定的LoRA适配器并采用学习型融合MLP进行组合，在仅需少量数据和无需全量微调的情况下，恢复约75%的联合多语言微调精度，支持高效增量式语言扩展。

Details

Motivation: 现有Text2SQL/SPARQL/Cypher系统多聚焦英语，缺乏可扩展、低开销的多语言支持方案；需避免重复全量微调与手动调参，同时保持接近联合多语言微调的性能。 Method: 训练英语、西班牙语、土耳其语各自的LoRA适配器，采用均匀线性融合或带动态门控的学习型融合MLP进行组合；支持仅新增一个LoRA适配器+轻量MLP重训练即可扩展新语言。 Result: 学习型融合MLP在三个语言上均优于线性融合，恢复约75%的联合多语言微调准确率，且所需数据更少；验证了其在性能、数据效率与可扩展性间的良好平衡。 Conclusion: 学习型适配器融合是一种实用替代方案，可在多语言Text2Cypher任务中兼顾性能、数据效率与可扩展性，支持低成本增量语言扩展。 Abstract: Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.

[42] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier

Haq Nawaz Malik,Kh Mohmad Shafi,Tanveer Ahmad Reshi

Main category: cs.CL

TL;DR: 本文提出SynthOCR-Gen，一个面向低资源语言的开源合成OCR数据集生成工具，通过将Unicode文本转化为带真实退化效果的图像数据，解决了缺乏标注训练数据的问题，并发布了60万样本的克什米尔语OCR数据集。

Details

Motivation: 低资源语言（如使用波斯-阿拉伯文字的克什米尔语）因缺乏大规模标注OCR训练数据，无法被主流OCR系统支持；人工构建数据集成本高、耗时长且易出错。 Method: 开发SynthOCR-Gen工具，包含文本分段（字符/词/n元/句/行）、Unicode规范化与文字纯度保障、多字体可配置渲染、以及25+种模拟文档退化的数据增强技术。 Result: 生成并公开发布了一个含60万样本的词级克什米尔语OCR合成数据集（HuggingFace），验证了该方法在提升低资源语言OCR性能上的有效性。 Conclusion: SynthOCR-Gen为低资源语言OCR提供了可扩展、低成本、高质量的数据生成范式，推动其融入视觉-语言AI模型生态，并已开源供全球研究者使用。 Abstract: Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.

[43] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Alphaeus Dmonte,Vidhi Gupta,Daniel J Perry,Mark Arehart

Main category: cs.CL

TL;DR: 本文首次从效率角度分析了多语言多任务模型合并策略，证明其在保持质量的同时显著降低训练时间和维护成本。

Details

Motivation: 微调多语言大语言模型存在计算效率低和维护瓶颈问题，而现有研究未系统评估模型合并策略的效率优势。 Method: 对多语言多任务模型合并策略进行聚焦效率分析，涵盖三个独立任务，并在公开与私有工业数据集上验证。 Result: 合并方法将初始训练时间减少最多50%；语言更新与再合并使维护训练成本降低超60%。 Conclusion: 模型合并是一种高效、可扩展且适用于工业场景的多语言模型维护与更新方案。 Abstract: Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.

[44] Automatic Classification of Arabic Literature into Historical Eras

Zainab Alhathloul,Irfan Ahmad

Main category: cs.CL

TL;DR: 本文提出使用神经网络和深度学习技术自动对阿拉伯语文本按历史时期进行分类，填补了该领域研究空白。实验在两个公开语料库上进行，二分类任务F1达0.83/0.79，多分类性能随类别数增加而下降。

Details

Motivation: 阿拉伯语历时演变显著，但现有研究极少探索阿拉伯文本的自动断代分类，尤其超出诗歌领域；历史分期虽有共识，却缺乏自动化方法支持。 Method: 采用神经网络与深度学习模型，基于OpenITI和APCD两个公开阿拉伯语料库构建数据集，开展从二分类到15类的多粒度时期分类实验，涵盖预定义历史分期与自定义时段划分。 Result: 二分类任务在OpenITI和APCD数据集上F1-score分别达0.83和0.79；15类和12类细粒度分类F1-score分别降至0.20和0.18。 Conclusion: 深度学习方法适用于阿拉伯文本粗粒度断代，但细粒度（>10类）分类效果显著下降，表明当前模型对细微历时特征建模能力有限，需进一步优化特征表示与数据划分策略。 Abstract: The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.

[45] LLM-in-Sandbox Elicits General Agentic Intelligence

Daixuan Cheng,Shaohan Huang,Yuxian Gu,Huatong Song,Guoxin Chen,Li Dong,Wayne Xin Zhao,Ji-Rong Wen,Furu Wei

Main category: cs.CL

TL;DR: 本文提出LLM-in-Sandbox框架，使大语言模型能在代码沙箱中自主探索，从而在非代码任务中展现通用智能；无需额外训练即具泛化能力，还可通过仅用非智能数据的强化学习进一步提升，已在多学科和长上下文等任务上验证有效性，并开源为Python包。

Details

Motivation: 提升大语言模型在非代码领域的通用智能，使其能自主利用沙箱环境（如访问外部资源、管理长上下文、执行脚本）完成复杂任务。 Method: 提出LLM-in-Sandbox框架，结合零样本沙箱探索与LLM-in-Sandbox-RL强化学习方法，后者仅使用非智能（non-agentic）数据进行训练。 Result: 在数学、物理、化学、生物医学、长上下文理解和指令遵循等任务上实现强泛化能力；同时验证了计算与系统层面的高效性，并开源为Python包。 Conclusion: LLM-in-Sandbox证明了沙箱环境可作为激发大模型通用智能的有效载体，且无需领域特定训练即可扩展能力，为构建更自主的AI系统提供了新范式。 Abstract: We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

cs.CV [Back]

[46] AI-Based Culvert-Sewer Inspection

Christina Thrainer

Main category: cs.CV

TL;DR: 本文针对排水系统中涵洞和污水管道缺陷自动分割任务，提出三种应对标注数据稀缺的方法：数据预处理增强（含动态标签注入）、轻量高效新架构FORTRESS（融合深度可分离卷积、自适应KAN与多尺度注意力），以及基于双向原型网络的少样本语义分割方法，均在有限标注下显著提升分割性能。

Details

Motivation: 涵洞和污水管道缺陷检测面临标注数据获取困难、成本高、依赖领域知识等问题，导致大规模标注数据集不可行，亟需适用于小样本场景的高效分割方法。 Method: 1）评估传统数据增强与动态标签注入等预处理策略；2）提出新型轻量架构FORTRESS，融合深度可分离卷积、自适应Kolmogorov-Arnold网络（KAN）和多尺度注意力机制；3）构建基于双向原型网络与注意力机制的少样本语义分割模型。 Result: 三类方法均在涵洞/污水管道缺陷数据集上显著提升IoU和F1分数；FORTRESS达到SOTA性能，同时大幅降低参数量与计算开销；少样本方法在低数据量下仍取得满意指标。 Conclusion: 通过数据增强、模型结构优化与少样本学习三条路径，本文有效缓解了缺陷分割任务中的数据稀缺问题，提升了模型实用性与部署可行性。 Abstract: Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.

[47] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition

Hatef Otroshi Shahreza,Anjith George,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文系统评估了多模态大语言模型（MLLMs）在异构人脸识别（HFR）任务中的性能，涵盖VIS-NIR、VIS-SWIR和VIS-THERMAL等跨模态场景，发现MLLMs在跨光谱条件下仍显著落后于传统人脸识别系统。

Details

Motivation: 探索多模态大语言模型（MLLMs）在异构人脸识别（HFR）这一具有挑战性的生物特征任务中的适用性与潜力。 Method: 对多个开源MLLMs在VIS-NIR、VIS-SWIR、VIS-THERMAL等跨模态人脸匹配场景下，采用标准生物特征协议（如Acquire Rate、EER、TAR）进行系统性基准测试与评估。 Result: MLLMs在各类跨模态HFR任务中表现远逊于经典人脸识别系统，尤其在SWIR和热成像等更具挑战性的模态组合下性能差距显著。 Conclusion: 当前MLLMs尚不适用于实际HFR应用；需更严格的生物特征评估范式来检验其在安全敏感场景下的可靠性。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.

[48] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

Pablo Messina,Andrés Villa,Juan León Alcázar,Karen Sánchez,Carlos Hinojosa,Denis Parra,Álvaro Soto,Bernard Ghanem

Main category: cs.CV

TL;DR: CURE是一种无需额外数据的错误感知课程学习框架，通过动态调整采样策略提升医学视觉-语言模型在放射报告生成中的视觉定位准确性和事实一致性。

Details

Motivation: 现有医学视觉-语言模型在放射报告生成中存在视觉-文本错位和事实不一致（如幻觉）问题，导致预测不可靠。 Method: CURE基于公开数据集，对多模态指令模型进行三阶段微调：短语定位、基于定位的报告生成、解剖结构引导的报告生成；采用性能驱动的动态采样策略，侧重难样本以增强空间与文本对齐。 Result: CURE将定位精度（IoU）提升0.37，报告质量（CXRFEScore）提升0.188，幻觉率降低18.6%。 Conclusion: CURE是一种数据高效的方法，显著提升了医学报告生成的视觉接地准确性和临床可靠性。 Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

[49] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction

Cuong Tran Van,Trong-Thang Pham,Ngoc-Son Nguyen,Duy Minh Ho Nguyen,Ngan Le

Main category: cs.CV

TL;DR: 本文提出DuFal框架，通过双频域-空间域处理架构提升稀疏视角锥束CT重建中高频解剖细节的恢复能力。

Details

Motivation: 稀疏视角锥束CT重建因X射线投影严重欠采样，难以恢复对应高频成分的精细解剖结构；传统CNN方法偏向低频信息学习，性能受限。 Method: 提出DuFal（Dual-Frequency-Aware Learning）框架：包含高局部因子化傅里叶神经算子（含全局与局部高频增强双分支）、谱-通道分解以降低参数量、跨注意力频率融合模块，以及特征解码器和强度场解码流程。 Result: 在LUNA16和ToothFairy数据集上，DuFal在极稀疏视角下显著优于现有SOTA方法，尤其在高频解剖特征保真度方面。 Conclusion: 双频感知学习可有效缓解稀疏CT重建中的高频信息丢失问题，所提模块化设计兼顾建模能力与计算效率。 Abstract: Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.

[50] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection

Morteza Poudineh,Marc Lalonde

Main category: cs.CV

TL;DR: 本文提出了一种基于偏差引导的提示学习框架，用于少样本正常图像下的异常检测（FNSAD），通过可学习提示和基于偏差的打分机制提升异常区域定位能力。

Details

Motivation: 现有方法在少样本异常检测中存在正常/异常提示区分度弱、缺乏合理的块级异常评分机制等问题。 Method: 引入可学习的上下文向量替代固定提示前缀，并设计异常特异性后缀；结合Top-K多实例学习的偏差损失，将块特征建模为相对于正常分布的高斯偏差。 Result: 在MVTecAD和VISA数据集上取得了优于PromptAD等基线方法的像素级检测性能；消融实验验证了可学习提示、偏差评分及Top-K MIL策略的有效性。 Conclusion: 该框架有效融合视觉语言模型的语义能力与统计偏差建模的可靠性，提升了少样本设定下异常检测的判别力、定位精度与可解释性。 Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.

[51] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

Yunshan Qi,Lin Zhu,Nan Bao,Yifan Zhao,Jia Li

Main category: cs.CV

TL;DR: 本文提出了一种基于传感器物理模型的统一NeRF框架，用于从单曝光模糊LDR图像及对应事件数据中合成锐利、高动态范围（HDR）的新视角图像。

Details

Motivation: 现有方法利用事件数据进行新视角合成时忽略了相机输出与真实世界辐射之间的传感器物理失配问题，导致HDR重建和去模糊效果不佳。 Method: 提出传感器物理驱动的NeRF框架：用NeRF直接建模HDR场景辐射；引入像素级RGB映射场对齐渲染值与LDR输入；设计事件映射场连接场景动态与事件传感器输出；联合优化两个映射场与NeRF。 Result: 在自建与公开数据集上实现了当前最优的单曝光模糊LDR图像+事件数据下的去模糊HDR新视角合成效果。 Conclusion: 该方法通过显式建模传感器物理过程，有效提升了极端光照下模糊LDR图像的新视角HDR重建质量与清晰度。 Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.

[52] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis

Jobeal Solomon,Ali Mohammed Mansoor Alsahag,Seyed Sahand Mohammadi Ziabari

Main category: cs.CV

TL;DR: 本文提出使用Vision Transformer（ViT）替代U-Net作为Attribute-Neutral Framework的编码器，以更有效地减少胸部X光片分类器中与性别和年龄相关的偏差，在保持诊断性能的同时显著降低人口统计属性泄露。

Details

Motivation: 胸部X光分类器常因利用性别和年龄相关特征捷径而产生偏差，导致少数群体系统性漏诊；现有基于卷积编码器的属性中性化方法未能在临床可用编辑强度下彻底消除属性泄露。 Method: 将Attribute-Neutral Framework中的U-Net卷积编码器替换为Data-efficient Image Transformer Small（DeiT-S）视觉Transformer，并在ChestX-ray14数据集上训练；在11个编辑强度级别下生成编辑图像，用独立AI判别器评估属性泄露（如性别识别AUC），并用CNN评估疾病预测性能（宏ROC AUC及亚组AUC）。 Result: 在中等编辑强度（alpha=0.5）下，ViT中性化器将性别识别AUC降至约0.80，比原U-Net框架低约10个百分点；宏ROC AUC在15种病变上仅比原始图像下降不超过5个百分点，最差亚组AUC仍接近0.70。 Conclusion: 全局自注意力视觉模型（如ViT）可在不损害临床效用的前提下进一步抑制人口统计属性泄露，为构建更公平的胸部X光AI提供实用路径。 Abstract: Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework's convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.

[53] Controllable Layered Image Generation for Real-World Editing

Jinrui Yang,Qing Liu,Yijun Li,Mengwei Ren,Letian Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou

Main category: cs.CV

TL;DR: 本文提出LASAGNA框架，用于联合生成图像及其分层表示（背景+带真实视觉效果的透明前景），支持多条件输入控制，并构建了新数据集LASAGNA-48K和首个分层编辑基准LASAGNABENCH。

Details

Motivation: 现有图像生成模型在编辑特定图像元素时缺乏可控性和一致性；分层表示虽具潜力，但现有方法难以生成具有合理合成关系及真实视觉效果（如阴影、反射）的图层。 Method: 提出LASAGNA统一框架，联合生成背景与高质量透明前景图层；构建含物理基础视觉效果的LASAGNA-48K数据集；设计首个分层编辑基准LASAGNABENCH；支持文本、前景、背景、位置掩码等多种条件输入。 Result: LASAGNA能同时生成高度一致、连贯的多层图像，准确保留身份与视觉效果，显著提升分层编辑能力；LASAGNA-48K和LASAGNABENCH将开源。 Conclusion: LASAGNA为可控、一致的图像分层生成与编辑提供了新范式，推动了基于图层的图像编辑研究与应用。 Abstract: Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers--a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs--text prompts, foreground, background, and location masks--offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.

[54] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views

William Huang,Siyou Pei,Leyi Zou,Eric J. Gonzalez,Ishan Chatterjee,Yang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于背侧手部皮肤形变的双流delta编码器方法，显著提升了自遮挡场景下的第一人称手部姿态估计精度，并支持新型交互范式。

Details

Motivation: XR设备普及使得第一人称手部姿态估计至关重要，但手指频繁自遮挡带来挑战。 Method: 利用密集视觉特征提取器挖掘背侧手部皮肤形变信息，设计双流delta编码器，通过动态手与松弛基准位的手部特征对比学习姿态。 Result: 仅使用裁剪的背侧图像，在手指遮挡率≥50%的自遮挡场景下，MPJAE较SOTA方法降低18%；同时提升捏取、点击等下游任务可靠性，并支持无可见运动的等长力检测（如表面‘点击’）。 Conclusion: 该方法在精度、鲁棒性与模型轻量化方面取得平衡，拓展了遮挡条件下的自然交互能力。 Abstract: The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.

[55] VIOLA: Towards Video In-Context Learning with Minimal Annotations

Ryo Fujii,Hideo Saito,Ryo Hachiuma

Main category: cs.CV

TL;DR: 本文提出VIOLA框架，通过密度-不确定性加权采样和置信度感知检索/提示，在极少量专家标注下实现多模态大模型对新视频域的高效上下文学习适配。

Details

Motivation: 现有视频领域多模态大语言模型泛化能力受限于标注数据稀缺，尤其在工业、手术等专业场景中难以获取大量专家标注；标准上下文学习依赖大规模标注池，不切实际。 Method: 提出VIOLA框架：1）密度-不确定性加权采样，在严苛标注预算下选取兼具多样性、代表性与信息量的样本；2）构建混合演示池，引入置信度感知检索（结合相似性与预测置信度）和置信度感知提示，使模型能区分真实标签与噪声伪标签。 Result: 在9个视频基准、4种MLLM上实验表明，VIOLA在低资源设置下显著优于各类基线，以极低标注成本实现鲁棒域适配。 Conclusion: VIOLA验证了仅需极少专家标注即可高效提升MLLM视频域泛化能力，为专业场景中训练-free、标签高效的模型适配提供了可行路径。 Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.

[56] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation

Sylvey Lin,Eranki Vasistha

Main category: cs.CV

TL;DR: 本文提出了一种新的评估指标RCA，用于衡量类条件DDPM在K-pop偶像人脸生成任务中的语义可控性，发现模型虽视觉质量高但存在严重语义模式坍塌问题。

Details

Motivation: 标准评估指标（如FID、IS）难以检测细粒度单领域任务中的身份错位问题，亟需更合适的语义可控性评估方法。 Method: 针对32x32 K-pop偶像人脸生成任务，构建类条件DDPM，并提出归一化于Oracle分类器的相对分类准确率（RCA）作为新评估指标；结合混淆矩阵分析失败原因。 Result: 模型FID为8.93（视觉质量高），但RCA仅为0.27（语义模式坍塌严重），尤其在视觉模糊身份上表现差；失败主因是分辨率限制与性别内相似性。 Conclusion: RCA为条件生成模型的身份一致性验证提供了严格标准，揭示了高视觉质量不等于高语义可控性，强调需在细粒度任务中采用任务适配的评估指标。 Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier's baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.

[57] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition

Weiwei Wu,Yueyang Li,Yuhu Shi,Weiming Zeng,Lang Qin,Yang Yang,Ke Zhou,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang

Main category: cs.CV

TL;DR: 本文提出RSM-CoDG框架，结合脑区先验、多尺度时序建模与协同域泛化策略，提升跨被试EEG情绪识别的鲁棒性与泛化能力。

Details

Motivation: 跨被试EEG情绪识别因被试间变异性大、神经表征复杂，导致分布偏移严重，现有方法难以在统一框架中兼顾跨被试对齐、多尺度动态建模与去偏。 Method: 提出Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization（RSM-CoDG）：1）基于功能脑区分区构建区域级空间表征；2）采用多尺度时序建模刻画情绪神经活动动态演化；3）引入多维约束的协同域泛化策略抑制被试特异性偏差。 Result: 在SEED系列数据集上显著优于现有方法，验证了其在未知被试场景下的强泛化能力与鲁棒性。 Conclusion: RSM-CoDG通过融合神经科学先验与协同域泛化，在统一框架中有效缓解跨被试分布偏移，为EEG情绪识别提供了可推广的新范式。 Abstract: Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at https://github.com/RyanLi-X/RSM-CoDG.

[58] Explainable Deepfake Detection with RL Enhanced Self-Blended Images

Ning Jiang,Dingheng Zeng,Yanhong Liu,Haiyang Yi,Shijie Yu,Minghe Weng,Haifeng Shen,Ying Li

Main category: cs.CV

TL;DR: 本文提出了一种基于自融合图像的自动化链式思维（CoT）数据生成框架和强化学习（RL）增强的深度伪造检测框架，以解决多模态大语言模型（MLLM）在可解释深度伪造检测中高质量标注数据稀缺的问题，并在多个跨数据集基准上达到SOTA性能。

Details

Motivation: 现有深度伪造检测方法缺乏可解释性，而多模态大语言模型（MLLM）虽具潜力，但受限于高成本、高难度的细粒度伪造归因文本标注；同时，强化学习在视觉任务尤其是跨域泛化中展现出优势，亟需探索其在该领域的应用。 Method: 提出基于自融合图像（Self-Blended Images）的自动化链式思维（CoT）数据生成框架，结合强化学习增强的检测框架，包含定制化奖励机制与反馈驱动的合成数据生成策略。 Result: 所提CoT数据构建流程、奖励机制与合成数据生成方法被大量实验验证有效，在多个跨数据集基准上性能媲美当前最优方法（SOTA）。 Conclusion: 本工作为降低MLLM在深度伪造检测中的标注开销提供了可行路径，并证实了强化学习在提升模型可解释性与泛化能力方面的潜力。 Abstract: Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.

[59] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception

Bo Yuan,Danpei Zhao,Wentao Li,Tian Li,Zhiguo Jiang

Main category: cs.CV

TL;DR: 本文提出持续全景感知（CPP）框架，将多模态与多任务持续学习结合，通过跨模态协同编码器、可塑知识继承模块、跨模态一致性约束及非对称伪标签策略，缓解灾难性遗忘与语义混淆，提升像素/实例/图像级联合感知能力。

Details

Motivation: 现有持续学习主要聚焦单任务场景，难以应对多任务、多模态下的语义混淆与灾难性遗忘问题，限制了智能感知系统的实际应用能力。 Method: 提出持续全景感知（CPP）模型：包括协同跨模态编码器（CCE）、基于对比特征蒸馏与实例蒸馏的可塑知识继承模块、跨模态一致性约束（CPP+），以及无需样本回放的非对称伪标签机制。 Result: 在多模态数据集与多样持续学习任务上实验表明，该模型尤其在细粒度CL任务中显著优于现有方法。 Conclusion: CPP为多模态、多任务持续学习提供了统一建模范式，有效提升模型在增量训练中的泛化性、鲁棒性与语义一致性。 Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.

[60] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction

Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao

Main category: cs.CV

TL;DR: 本文提出了SuperOcc，一种基于超二次曲面（superquadric）的3D占用预测新框架，通过协同时间建模、多超二次曲面解码和高效体素溅射，兼顾几何表达力、查询稀疏性与计算效率，在SurroundOcc和Occ3D上达到SOTA。

Details

Motivation: 现有3D占用预测方法多采用密集场景表示，忽视真实驾驶场景的稀疏性；虽有超二次曲面等稀疏表示方法出现，但仍存在时间建模不足、查询稀疏性与几何表达力难以兼顾、超二次曲面到体素溅射效率低等问题。 Method: 提出SuperOcc框架，包含三项核心设计：(1) 协同时间建模机制，联合利用视角中心与物体中心的时间线索；(2) 多超二次曲面解码策略，在保持查询稀疏的同时提升几何表达能力；(3) 高效的超二次曲面到体素溅射方案。 Result: 在SurroundOcc和Occ3D基准上实现SOTA性能，同时具备更高计算效率。 Conclusion: SuperOcc验证了基于稀疏超二次曲面表征进行高效、高精度3D占用预测的可行性，为自动驾驶环境感知提供了新思路。 Abstract: 3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at https://github.com/Yzichen/SuperOcc.

[61] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

Zhenghui Guo,Yuanbin Man,Junyuan Sheng,Bowen Lin,Ahmed Ahmed,Bo Jiang,Boyuan Zhang,Miao Yin,Sian Jin,Omprakash Gnawal,Chengming Zhang

Main category: cs.CV

TL;DR: Event-VStream是一种事件感知的实时长视频理解框架，通过检测语义连贯的事件边界来触发语言生成，并将事件嵌入存入持久化记忆库，从而在低延迟下实现长时序推理。

Details

Motivation: 现有VLM在处理长视频流时面临冗余帧处理和快速遗忘历史上下文的问题，固定间隔解码或缓存剪枝策略难以兼顾效率与信息完整性。 Method: 提出Event-VStream框架，融合运动、语义和预测线索检测视频中的语义事件边界，仅在边界触发语言生成；每个事件嵌入被整合进持久化记忆库以支持长程推理。 Result: 在OVOBench-Realtime上相对VideoLLM-Online-8B提升+10.4分，在Ego4D长视频流中保持约70% GPT-5胜率，并接近Flash-VStream-7B性能，仅使用LLaMA-3-8B文本骨干。 Conclusion: 事件驱动的记忆机制可有效缓解长视频流理解中的冗余与遗忘问题，在保证低延迟的同时提升长时序推理能力。 Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.

[62] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

Hongyang Wei,Hongbo Liu,Zidong Wang,Yi Peng,Baixin Xu,Size Wu,Xuying Zhang,Xianglong He,Zexiang Liu,Peiyu Wang,Xuchen Song,Yangguang Li,Yang Liu,Yahui Zhou

Main category: cs.CV

TL;DR: 本文提出了Skywork UniPic 3.0，一个统一的多模态框架，支持任意数量和分辨率的输入图像（1~6张，总像素≤1024×1024）进行高质量多图合成，尤其聚焦于人类-物体交互（HOI）任务；通过创新的数据流水线、将多图合成建模为序列生成问题的新训练范式，以及融合轨迹映射与分布匹配的加速推理策略，仅用700K高质量样本即达SOTA性能，并实现8步高保真生成与12.5倍加速。

Details

Motivation: 社区对多图合成（尤其是HOI类）需求激增，但现有模型缺乏高质量融合的具体方法细节，且面临一致性与质量双重挑战。 Method: 提出Skywork UniPic 3.0统一框架；设计面向HOI的多图合成专用数据收集、过滤与合成流水线；将多图合成建模为条件序列生成任务；引入轨迹映射与分布匹配实现高效后训练推理加速。 Result: 在单图编辑基准上达SOTA，在多图合成基准上超越Nano-Banana和Seedream 4.0；仅需700K样本、8步采样即可生成高保真结果，推理速度提升12.5倍。 Conclusion: 所提出的统一框架、数据策略与序列化训练范式有效解决了多图合成中的一致性与效率难题，显著提升了HOI等复杂场景下的合成质量与速度。 Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.

[63] Consistency-Regularized GAN for Few-Shot SAR Target Recognition

Yikui Zhai,Shikuang Liu,Wenlve Zhou,Hongsheng Zhang,Zhiheng Zhou,Xiaolin Tian,C. L. Philip Chen

Main category: cs.CV

TL;DR: 本文提出Cr-GAN，一种在极少量SAR图像下仍能生成高质量样本的GAN框架，通过双分支判别器、通道级特征插值与双域循环一致性机制解决小样本下GAN训练不稳定的矛盾，显著提升少样本SAR目标识别性能。

Details

Motivation: SAR图像少样本识别受限于数据极度稀缺，而传统GAN需大量数据训练，与少样本前提矛盾，亟需一种能在极少样本下稳定训练并生成高质量数据的生成模型。 Method: 提出一致性正则化生成对抗网络（Cr-GAN）：采用双分支判别器解耦对抗训练与表征学习；引入通道级特征插值生成新潜在特征；设计双域（图像域与特征域）循环一致性机制保障语义一致性；支持多种GAN架构，并适配多种自监督预训练方法。 Result: 在MSTAR和SRSDD数据集8-shot设置下分别达到71.21%和51.64%准确率，显著超越现有基线；参数量仅为先进扩散模型的约1/5。 Conclusion: Cr-GAN有效缓解了少样本SAR识别中生成模型训练数据依赖与实际数据稀缺之间的根本矛盾，为低资源遥感图像学习提供了可扩展、高效率的新范式。 Abstract: Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: https://github.com/yikuizhai/Cr-GAN.

[64] Performance-guided Reinforced Active Learning for Object Detection

Zhixuan Liang,Xingyu Zeng,Rui Zhao,Ping Luo

Main category: cs.CV

TL;DR: 本文提出了一种面向目标检测任务的性能导向型强化主动学习方法MGRAL，以mAP提升为奖励信号，利用强化学习代理选择最具信息量的样本，同时采用无监督快速查表法降低计算开销。

Details

Motivation: 现有主动学习方法评估样本信息量时未直接关联下游任务性能（如目标检测中的mAP），导致标注效率与模型性能提升脱节。 Method: 提出MGRAL框架：以期望模型输出变化作为信息量度量；采用基于策略梯度的强化学习采样代理解决批样本组合爆炸与mAP不可导问题；引入无监督快速查表法近似mAP以降低计算成本。 Result: 在PASCAL VOC和COCO数据集上的目标检测任务中，MGRAL取得了最优的主动学习曲线，并提供了具说服力的可视化结果。 Conclusion: MGRAL建立了强化学习驱动的目标检测主动学习新范式，实现了标注效率与下游性能提升的更好协同。 Abstract: Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data's distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL's active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.

[65] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Mingyu Yu,Lana Liu,Zhehao Zhao,Wei Wang,Sujuan Qin

Main category: cs.CV

TL;DR: 本文提出了一种名为Beyond Visual Safety（BVS）的图像-文本对越狱框架，用于探测多模态大语言模型（MLLMs）的视觉安全边界，通过‘重建-生成’策略实现高达98.21%的越狱成功率，揭示了当前MLLMs在视觉安全对齐方面的关键漏洞。

Details

Motivation: 现有研究对MLLMs的安全漏洞已有探索，但对其视觉安全边界的探究仍不充分，亟需系统性方法评估其视觉内容安全防护能力。 Method: 提出BVS框架，采用‘重建-生成’策略，结合中性化视觉拼接与归纳式重构，将恶意意图从原始输入中解耦，诱导MLLMs生成有害图像。 Result: BVS在GPT-5（2026年1月12日发布版）上实现了98.21%的越狱成功率。 Conclusion: 当前MLLMs在视觉安全对齐方面存在严重缺陷，BVS暴露了其在处理图像-文本联合输入时的关键脆弱性，亟需加强多模态安全机制设计。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.

Ali Caglayan,Nevrez Imamoglu,Toru Kouyama

Main category: cs.CV

TL;DR: 本文提出了一种针对日本全国尺度ALOS-2 SAR数据的LULC语义分割方法，通过三项轻量改进缓解SAR密集预测中的边界模糊、细长结构漏检和长尾类别性能下降问题，在不增加流程复杂度前提下提升了整体及稀有类别的分割精度。

Details

Motivation: 解决SAR图像密集预测中常见的边界过度平滑、细长结构漏检以及长尾标签下罕见类别性能退化等问题。 Method: 基于SAR-W-MixMAE自监督预训练框架，引入三项轻量改进：(i) 将高分辨率特征注入多尺度解码；(ii) 设计渐进式 refine-up 解码头，交替进行卷积精修与逐步上采样；(iii) 在focal+dice损失中引入α尺度因子调节类别重加权。 Result: 在全日本ALOS-2 LULC基准上取得一致提升，尤其改善了欠表示类别的分割性能，并在标准评估指标下提升了水体检测精度。 Conclusion: 所提方法在不增加管道复杂度的前提下，有效缓解了SAR语义分割中的关键挑战，具备良好的实用性与泛化性。 Abstract: This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $α$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.

[67] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework

Shubham Shukla,Kunal Sonalkar

Main category: cs.CV

TL;DR: 本文提出一个三层次评估框架，用于系统评测视觉-语言模型（VLMs）在细粒度时尚属性预测任务中的表现，特别关注属性适用性检测（如‘外层面料’在无外套时不可用）与细粒度分类的解耦分析；在DeepFashion-MultiModal数据集上实验表明，零样本VLMs整体宏F1达64.0%，显著优于传统方法，但在适用性检测（NA-F1仅34.1%）上存在明显瓶颈，而高效模型（如GPT-5 Mini）可达到旗舰模型90%性能。

Details

Motivation: 现有VLMs在时尚多属性预测中缺乏系统性评估，且时尚属性具有条件性（如某属性仅在特定服装存在时才适用），需先判断属性是否适用（applicability detection）再分类，但该关键环节未被充分建模和评测。 Method: 提出三层次评估框架：(1) 整体任务性能（含NA类）；(2) 属性适用性检测（是否为NA）；(3) 可判定属性下的细粒度分类；在DeepFashion-MultiModal数据集（显式标注NA）上，对比9种VLMs（涵盖旗舰/高效/超高效三档）与基于Fashion-CLIP嵌入训练的分类器。 Result: （1）零样本VLMs宏F1达64.0%，是Fashion-CLIP+LogReg的三倍；（2）细粒度分类（Tier 3）F1达70.8%，但适用性检测（Tier 2）NA-F1仅34.1%，成为主要瓶颈；（3）高效模型（如GPT-5 Mini）性能达旗舰模型90%以上，成本更低。 Conclusion: 三层次框架能精准定位错误来源（适用性误判 or 分类错误），为时尚零售场景中VLMs的落地优化提供诊断依据；提升适用性检测能力是当前关键改进方向，而高效VLMs已具备实用部署价值。 Abstract: Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.

[68] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

Chenglin Li,Qianglong Chen,Feng Han,Yikun Wang,Xingxi Yin,Yan Gong,Ruilin Li,Yin Zhang,Jiaqi Wang

Main category: cs.CV

TL;DR: 本文提出VideoThinker，一种基于合成工具交互轨迹训练的智能体视频大语言模型，通过在字幕空间生成多步工具使用序列并映射回视频帧，构建无需长视频理解能力即可生成的大规模视频-工具推理数据集，显著提升长视频理解性能。

Details

Motivation: 现有视频大语言模型依赖均匀采样帧进行静态推理，导致长视频时序定位弱、信息损失大；而引入时序检索、空间/时序缩放等智能体工具虽可缓解，但其数据构造又依赖已有强长视频理解模型，形成循环依赖。 Method: 提出VideoThinker框架：先将视频转为丰富字幕，利用强智能体语言模型在字幕空间生成多步工具使用序列，再将字幕替换为对应视频帧，构建大规模交错式视频-工具推理合成数据集，并在此数据上端到端训练视频大语言模型。 Result: VideoThinker在长视频基准测试中显著超越纯字幕语言模型智能体及强视频模型基线，展现出动态推理、自适应时序探索和多步工具使用能力。 Conclusion: 工具增强的合成数据与自适应检索+缩放推理范式，是提升长视频理解能力的有效路径；无需真实长视频标注即可通过字幕空间建模实现高质量智能体视频理解训练。 Abstract: Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.

Linyong Zou,Liang Zhang,Xiongfei Wang,Jia-Hong Gao,Yi Sun,Shurong Sheng,Kuntao Xiao,Wanli Yang,Pengfei Teng,Guoming Luan,Zhao Lv,Zikang Xu

Main category: cs.CV

TL;DR: 本文提出FAIR-ESI框架，通过多视角自适应特征重要性精炼（频谱、时序、patch-wise）提升脑电源成像精度，并在模拟与临床数据上验证其有效性。

Details

Motivation: 准确选择和精炼特征是实现精准脑电生理源成像（ESI）的核心挑战。 Method: 提出FAIR-ESI框架，包含FFT频谱特征精炼、加权时序特征精炼和自注意力驱动的patch-wise特征精炼三个模块，实现跨视角特征重要性自适应优化。 Result: 在两个模拟数据集和两个真实临床数据集上的实验表明，该框架显著提升了ESI精度，展现出对脑疾病诊断和脑功能研究的应用潜力。 Conclusion: FAIR-ESI为ESI提供了可解释、自适应的多视角特征优化范式，有望推动脑疾病精准诊断与神经机制解析。 Abstract: An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework's efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.

Shadi Alijani,Fereshteh Aghaee Meibodi,Homayoun Najjaran

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态医学影像的新型基础模型适配框架，包含子区域感知的模态注意力机制和自适应提示工程，显著提升了脑肿瘤分割（尤其是坏死核心区域）的准确性。

Details

Motivation: 现有基础模型在多模态医学影像中难以有效融合多源信息并适应病理组织的异质性。 Method: 提出子区域感知的模态注意力机制与自适应提示工程，实现对不同肿瘤子区域的最优模态组合学习及基础模型能力的精准调用。 Result: 在BraTS 2020数据集上验证，本方法在坏死核心等挑战性子区域显著优于基线方法。 Conclusion: 该框架为多模态融合与提示工程提供了原理清晰、效果显著的新范式，推动基础模型在医学影像中更准确、鲁棒的应用。 Abstract: The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.

[71] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework

Xinjue Hu,Chi Wang,Boyu Wang,Xiang Zhang,Zhenshan Tan,Zhangjie Fu

Main category: cs.CV

TL;DR: 本文提出ARDIS，首个任意分辨率深度图像隐写框架，通过频域解耦架构和潜在引导的隐式重建器，解决传统方法中因分辨率不一致导致的细节丢失和盲恢复难题，显著提升隐写不可见性和跨分辨率恢复保真度。

Details

Motivation: 现有深度图像隐写方法要求秘密图像与载体图像分辨率一致，导致不匹配分辨率需重采样（造成细节损失）且无法在未知原始分辨率下准确恢复。 Method: 提出ARDIS框架：1）隐藏阶段采用频率解耦架构，将秘密图像分解为分辨率对齐的全局基和分辨率无关的高频潜在表示；2）恢复阶段使用潜在引导的隐式重建器，通过连续隐式函数渲染高频残差；3）引入隐式分辨率编码策略，将离散分辨率映射为稠密特征图并嵌入特征冗余空间以实现盲恢复。 Result: ARDIS在不可见性和跨分辨率恢复保真度上显著优于当前最先进方法。 Conclusion: ARDIS成功将深度图像隐写范式从离散映射转向参考引导的连续信号重建，解决了分辨率不一致带来的关键挑战，实现了任意分辨率下的高保真、盲恢复隐写。 Abstract: Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret's resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.

[72] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification

Yimin Zhu,Lincoln Linlin Xu,Zhengsen Xu,Zack Dewis,Mabel Heffring,Saeid Taleghanidoozdoozan,Motasem Alkayid,Quinn Ledingham,Megan Greenwood

Main category: cs.CV

TL;DR: 本文提出了一种物理光谱感知的白盒超连接框架ES-mHC，用于高光谱图像分类，通过结构化、有向矩阵显式建模不同电磁波谱分组间的交互，提升模型可解释性与内部机制理解。

Details

Motivation: 现有深度学习模型在高光谱图像分类中依赖不透明的光谱-空间特征混合，导致可解释性差、决策机制难以理解。 Method: 提出ES-mHC框架，将特征表示与交互结构分离，利用残差流中的结构化方向矩阵显式建模电磁波谱分组间交互，并支持可视化与空间分析。 Result: 实验表明学习到的超连接矩阵呈现一致的空间模式和非对称交互行为；扩展率提高可加速结构化交互模式的出现。 Conclusion: ES-mHC将高光谱图像分类从纯黑箱预测转变为结构透明、部分白箱的学习过程。 Abstract: In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.

[73] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)

Qi Zeng,Weide Liu,Bo Li,Ryne Didier,P. Ellen Grant,Davood Karimi

Main category: cs.CV

TL;DR: FeTal-SAM 是一种基于 SAM 的新方法，通过多图谱配准生成空间对齐的密集提示（结合边界框），实现胎儿脑 MRI 中任意结构的灵活、免重训练分割，在高对比度结构上达到 SOTA 性能，兼顾临床适应性。

Details

Motivation: 解决传统深度学习方法在胎儿脑 MRI 分割中依赖大量标注数据、难以适配动态变化的标签定义，以及分割结果难以区分是源于真实图像对比度还是空间先验的问题。 Method: 将多图谱配准生成的空间对齐标签模板作为密集提示，与边界框提示共同输入 SAM 的分割解码器，实现单结构二值分割，再融合为完整 3D 分割体。 Result: 在 dHCP 和内部数据集上验证，对皮层板、小脑等高对比度结构达到与专用训练模型相当的 Dice 分数；支持用户自定义解剖结构分割；对海马、杏仁核等低对比度结构精度略低。 Conclusion: FeTal-SAM 是一种无需针对每个标签或数据集重新训练的通用胎儿脑 MRI 分割框架，显著提升临床适用性与灵活性，是迈向可适应临床需求的胎儿脑分析工具的重要进展。 Abstract: This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM's segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM's robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM's potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.

[74] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps

Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Jingrui Zhang,Kefei Qian,Wenbo Chu,Keqiang Li

Main category: cs.CV

TL;DR: 本文提出LL-GaussianMap，首个将2D高斯溅射（2DGS）引入低照度图像增强的无监督框架，通过显式结构建模生成增益图，在保持边缘和抑制伪影的同时实现高性能、低存储开销的增强。

Details

Motivation: 现有低照度图像增强方法多在像素域或隐式特征空间操作，忽视图像固有的几何结构先验；而2D高斯溅射虽具优异结构拟合与渲染效率，却尚未应用于底层视觉任务。 Method: 提出两阶段无监督框架：首先用2DGS进行高保真结构重建；再通过高斯光栅化机制在统一增强模块中渲染数据驱动的增强字典系数，以指导增益图生成。 Result: 在多个基准上实现优越增强性能，同时模型存储开销极低，验证了显式高斯表示在图像增强中的有效性。 Conclusion: LL-GaussianMap首次成功将2DGS引入低照度增强任务，证明显式几何结构建模可显著提升增强质量与鲁棒性，并支持无监督学习。 Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.

[75] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting

Yuhan Chen,Wenxuan Yu,Guofa Li,Yijun Xu,Ying Fang,Yicui Shi,Long Cao,Wenbo Chu,Keqiang Li

Main category: cs.CV

TL;DR: 本文提出LL-GaussianImage，首个在2D高斯泼溅（2DGS）压缩表示域中直接进行零样本无监督低光增强的框架，避免了传统解压-增强-再压缩流程，兼顾高效性与高质量重建。

Details

Motivation: 现有低光增强方法主要在像素域操作，处理2DGS压缩图像需繁琐的解压-增强-再压缩流程，导致效率低和二次退化。 Method: 提出语义引导的MoE增强框架、多目标协同损失函数系统和两阶段优化过程，在2DGS稀疏属性空间中实现压缩即增强与重建即增强。 Result: 在保持高压缩比的同时实现了高质量低光图像增强，实验验证了直接在压缩表示域处理的可行性与优越性。 Conclusion: LL-GaussianImage开创了在显式场景表示压缩域内直接增强的新范式，兼具效率、保真度与鲁棒性。 Abstract: 2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.

[76] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation

Liuyun Jiang,Yanchao Zhang,Jinyue Guo,Yizhuo Lu,Ruining Zhou,Hua Han

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的数据增强框架，用于电子显微镜神经元分割，通过分辨率感知的条件扩散模型和生物学引导的掩码重塑模块，生成结构多样且合理的图像-标签对，在低标注数据下显著提升分割性能。

Details

Motivation: 现有基于深度学习的神经元分割方法依赖大量人工标注数据，而传统数据增强方法生成样本结构多样性不足。 Method: 提出扩散模型驱动的数据增强框架，包括分辨率感知的多尺度条件扩散模型（用于从3D掩码合成体素级图像）和生物学引导的掩码重塑模块（提升掩码结构真实性）。 Result: 在AC3和AC4数据集低标注设置下，ARAND指标分别提升32.1%和30.7%（结合两种后处理方法）。 Conclusion: 该框架有效缓解标注稀缺问题，提升了神经元分割精度，代码已开源。 Abstract: Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at https://github.com/HeadLiuYun/NeuroDiff.

[77] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Pascal Benschop,Justin Dauwels,Jan van Gemert

Main category: cs.CV

TL;DR: 本文提出了一种用于评估视觉语言模型（VLMs）空间推理能力的合成视频基准，聚焦于情境与空间意识，发现现有模型表现仅略高于随机水平，并探讨了颜色线索等轻量先验的潜在作用。

Details

Motivation: 现有视觉语言模型在依赖细微时间或几何线索的空间推理任务上表现脆弱，缺乏系统性诊断工具来识别其具体弱点。 Method: 构建了一个合成视频基准，包含最小对比视频对，专门测试三类空间-情境推理任务：暴力与良性行为区分、跨视角施害者角色绑定、细粒度运动轨迹对齐；在零训练设定下评估主流VLMs，并引入稳定颜色线索作为辅助干预。 Result: 所有任务上模型性能均仅略高于随机水平；稳定颜色线索可部分缓解施害者角色混淆，但无法根本解决空间推理缺陷。 Conclusion: 当前VLMs的空间推理能力严重不足，亟需结合轻量级空间先验与大规模预训练，该基准为后续研究提供了可复现的诊断工具和改进方向。 Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.

[78] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks

Mustafa Yurdakul,Enes Ayan,Fahrettin Horasan,Sakir Tasdemir

Main category: cs.CV

TL;DR: 本文提出了一种基于CNN的移动应用，用于非专业人士快速识别花卉种类，通过比较MobileNet、DenseNet121和Xception三种模型及七种优化算法，发现DenseNet121结合SGD效果最佳，准确率达95.84%。

Details

Motivation: 花卉识别通常需要专家知识，但专家资源难以随时获取，因此需开发便捷的移动工具辅助非专业人士识别花卉。 Method: 采用MobileNet、DenseNet121和Xception三种CNN模型，并分别结合七种优化算法进行训练与评估。 Result: DenseNet121配合SGD优化算法表现最优，准确率、精确率、召回率和F1分数均达约96%。 Conclusion: CNN模型（尤其是DenseNet121）适用于移动端花卉分类任务，具备实用性和高精度。 Abstract: A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.

[79] Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data

Clare Chemery,Hendrik Edelhoff,Ludwig Bothmann

Main category: cs.CV

TL;DR: 本文介绍了一种轻量级机器学习实验流程，旨在降低生态学家使用图像分类技术的门槛，支持其在本地数据上独立构建任务特定的分类器，并在红鹿年龄与性别分类任务中验证了有效性。

Details

Motivation: 降低生态学家应用机器学习进行图像分类的门槛，使其摆脱对现成模型的依赖，能针对本地数据和具体研究问题构建定制化模型。 Method: 开发了一个集命令行接口（用于预处理、训练、评估）与图形界面（用于标注、错误分析、模型比较）于一体的轻量级ML实验流程；在红鹿图像数据集上测试多种骨干网络、参数配置与数据增强策略。 Result: 在红鹿年龄分类任务中达到90.77%准确率，性别分类达96.15%准确率，证明小规模数据下解决窄域生态问题的可行性。 Conclusion: 该框架为生态学家提供了易用、可迭代的ML建模工具，有助于推动机器学习在野生动物监测与种群 demographics 分析中的广泛应用。 Abstract: We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.

[80] Towards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion

Yonghao Xu,Pedram Ghamisi,Qihao Weng

Main category: cs.CV

TL;DR: 本文首次将数据集蒸馏引入遥感图像解译领域，利用文本到图像扩散模型压缩大规模遥感数据集，并通过分类器引导和潜在空间聚类提升合成样本判别性与多样性。

Details

Motivation: 解决深度学习在遥感图像解译中依赖大规模标注数据所带来的高存储计算成本与敏感数据泄露风险。 Method: 提出基于文本到图像扩散模型的数据集蒸馏方法，引入分类一致性损失进行分类器驱动引导，并结合潜在空间聚类选取代表性原型作为视觉风格引导，辅以视觉语言模型生成聚合文本描述。 Result: 在三个高分辨率遥感场景分类基准上验证了所生成样本具有真实性和多样性，能有效支持下游模型训练。 Conclusion: 该方法为遥感图像解译提供了高效、安全且轻量的数据供给新范式。 Abstract: Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (https://github.com/YonghaoXu/DPD).

[81] An IoT-Based Smart Plant Monitoring and Irrigation System with Real-Time Environmental Sensing, Automated Alerts, and Cloud Analytics

Abdul Hasib,A. S. M. Ahsanul Sarkar Akib

Main category: cs.CV

TL;DR: 本文提出了一种基于IoT的低成本智能植物监测系统，集成多传感器、自动灌溉与云平台分析，实现精准农业监控与节水40%。

Details

Motivation: 传统农业依赖人工观察和周期性浇水，导致水资源浪费、植物生长不均及对环境变化响应滞后，亟需智能化、可持续的监测解决方案。 Method: 采用ESP32微控制器采集DHT22（温湿度）、HC-SR04（水位）和土壤湿度传感器数据，通过OLED显示与蜂鸣器报警，并将数据无线上传至ThingSpeak云平台进行远程监控、历史分析与自动告警；配套开发Web仪表盘实现可视化。 Result: 系统在维持土壤湿度方面达92%准确率，实现实时环境监测，节水约40%，总成本仅45.20美元，适用于家庭园艺与商业农业。 Conclusion: 该系统是一种经济、可扩展、实用的智能农业解决方案，有效提升资源利用效率与作物健康管理能力。 Abstract: The increasing global demand for sustainable agriculture necessitates intelligent monitoring systems that optimize resource utilization and plant health management. Traditional farming methods rely on manual observation and periodic watering, often leading to water wastage, inconsistent plant growth, and delayed response to environmental changes. This paper presents a comprehensive IoT-based smart plant monitoring system that integrates multiple environmental sensors with automated irrigation and cloud analytics. The proposed system utilizes an ESP32 microcontroller to collect real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with visual feedback through an OLED display and auditory alerts via a buzzer. All sensor data is wirelessly transmitted to the ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alert generation. Experimental results demonstrate the system's effectiveness in maintaining optimal soil moisture levels (with 92\% accuracy), providing real-time environmental monitoring, and reducing water consumption by approximately 40\% compared to conventional irrigation methods. The integrated web dashboard offers comprehensive visualization of plant health parameters, making it suitable for both small-scale gardening and commercial agriculture applications. With a total implementation cost of \$45.20, this system provides an affordable, scalable solution for precision agriculture and smart farming.

[82] TinySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing

Toan Gian,Dung T. Tran,Viet Quoc Pham,Francesco Restuccia,Van-Dinh Nguyen

Main category: cs.CV

TL;DR: TinySense提出了一种基于VQGAN的高效Wi-Fi CSI数据压缩框架，用于提升设备无感、隐私保护的人体姿态估计（HPE）系统的可扩展性，在保持高HPE精度的同时显著降低延迟与网络开销。

Details

Motivation: 现有Wi-Fi传感方法直接处理大量CSI数据，导致网络资源紧张，难以满足设备无感和隐私保护的人体姿态估计需求。 Method: 提出TinySense框架：1）基于VQGAN学习紧凑码本实现CSI数据压缩；2）用K-means动态聚类预训练码本以自适应调整压缩比特率；3）引入Transformer模型缓解比特率损失、增强网络鲁棒性；4）在Jetson Nano与Raspberry Pi上原型验证。 Result: 相比SOTA压缩方案，TinySense在相同压缩率下HPE准确率（PCK20）提升1.5倍，延迟降低最多5倍，网络开销减少最多2.5倍。 Conclusion: TinySense通过联合优化压缩效率与感知精度，为资源受限边缘环境下的Wi-Fi人体姿态估计提供了高效、鲁棒且实用的解决方案。 Abstract: With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.

[83] A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies

Jingsong Xia,Siqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、受脑启发的深度学习框架，用于冠状动脉造影（CAG）图像的二分类，通过神经可塑性训练、注意力调制损失和类不平衡感知采样等策略，在资源受限下实现高鲁棒性与泛化性。

Details

Motivation: 现实临床中冠状动脉造影图像存在病变形态复杂、类别严重失衡、标注不确定性及计算资源有限等问题，导致传统深度学习方法鲁棒性和泛化性不足。 Method: 基于预训练CNN构建轻量混合神经表征；引入选择性神经可塑性训练策略；设计融合Focal Loss与标签平滑的脑启发注意力调制损失函数；采用类不平衡感知采样与带热重启的余弦退火优化策略。 Result: 在二分类任务上取得具有竞争力的准确率、召回率、F1分数和AUC，同时保持高计算效率与性能稳定性。 Conclusion: 脑启发学习机制可有效提升轻量级模型在医学图像分析中的性能，为资源受限场景下的智能临床决策支持提供了生物可解释且可部署的解决方案。 Abstract: Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.

[84] Out-of-Distribution Detection Based on Total Variation Estimation

Dabiao Ma,Zhiba Su,Jian Yang,Haojun Fei

Main category: cs.CV

TL;DR: 本文提出了一种名为TV-OOD的新型分布外检测方法，利用总变差网络估计器计算输入对总变差的贡献，从而有效区分分布内与分布外数据，在图像分类任务中表现优于或媲美现有先进方法。

Details

Motivation: 解决机器学习模型在实际部署中因分布偏移带来的安全风险，提升分布外检测性能。 Method: 提出TV-OOD方法，基于总变差网络估计器计算每个输入对总变差的贡献，定义为总变差分数，用于区分分布内与分布外数据。 Result: 在多种模型和数据集上的图像分类任务中，TV-OOD在所有评估指标上均达到或超过当前前沿的分布外检测方法。 Conclusion: TV-OOD是一种高效、鲁棒的分布外检测方法，可显著增强机器学习模型在现实场景中的可靠性与安全性。 Abstract: This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input's contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method's efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.

Yifan Chen,Fei Yin,Hao Chen,Jia Wu,Chao Li

Main category: cs.CV

TL;DR: 本文介绍了首个公开的、完全配对的泛癌种医学影像数据集，涵盖11个人体器官，支持MRI动态增强序列（DCE1-DCE3）和CT非对比/对比增强（CT/CTC）配对，并基于此构建了对比剂图像合成的综合基准，推动安全、高效的AI辅助影像诊断研究。

Details

Motivation: 现有AI合成对比增强图像的研究受限于数据稀缺：公共数据集多局限于脑部MRI配对数据；其他数据存在配对不全、模态/时间戳缺失、空间错位及缺乏明确增强阶段标注等问题；大量高质量数据仍处于私有状态。 Method: 构建首个公开、完全配对、跨11器官的泛癌种医学影像数据集（含完整DCE-MRI三相序列与CT/CTC配对），确保解剖结构一致性；设计支持1-to-1、N-to-1、N-to-N翻译任务的综合评估基准；在该基准上系统评测主流图像到图像翻译模型。 Result: 建立了目前最全面的对比增强图像合成基准，验证了多种基线模型在多器官、多模态、多时相设置下的性能；数据集与代码已开源，显著缓解了该领域数据瓶颈。 Conclusion: 该工作通过构建高质量、大规模、多器官配对数据集与标准化基准，为无对比剂影像合成提供了关键基础设施，有望提升肿瘤影像诊断的安全性、可及性与临床实用性。 Abstract: Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.

[86] Understanding the Transfer Limits of Vision Foundation Models

Shiqi Huang,Yipei Wang,Natasha Thorley,Alexander Ng,Shaheer Saeed,Mark Emberton,Shonit Punwani,Veeru Kasivisvanathan,Dean Barratt,Daniel Alexander,Yipeng Hu

Main category: cs.CV

TL;DR: 本文探讨了视觉基础模型（VFMs）在下游任务中表现不均衡的问题，提出预训练目标与下游任务需求之间存在错配，并通过前列腺多参数MRI任务验证了预训练与下游任务对齐程度（如MMD）对迁移性能的影响。

Details

Motivation: 视觉基础模型在下游任务中表现不均衡，可能源于预训练目标与下游任务需求之间的错配。 Method: 在前列腺多参数MRI的五个临床任务上评估两种视觉基础模型（MAE-based的ProFound和对比学习的ProViCNet），并用最大均值差异（MMD）等简单发散度量分析预训练与下游任务的对齐程度。 Result: 预训练与下游任务对齐程度越高（MMD越小），迁移性能提升越大、微调收敛越快。 Conclusion: 设计预训练目标时应更注重其下游适用性，对齐分析（如MMD）可作为指导预训练策略选择的有效指标。 Abstract: Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.

[87] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

Anas Anwarul Haq Khan,Mariam Husain,Kshitij Jadhav

Main category: cs.CV

TL;DR: RadJEPA是一种无需语言监督的自监督医学视觉表征学习框架，仅使用未标注胸部X光图像进行预训练，通过预测被掩码图像区域的潜在表示来学习，显著优于现有方法如Rad-DINO。

Details

Motivation: 现有医学视觉语言模型依赖稀缺的图文配对数据，限制了其可扩展性；本文旨在探索不依赖语言监督、仅用无标签医学影像能否学习到鲁棒的放射学编码器。 Method: 提出RadJEPA框架，基于联合嵌入预测架构（JEPA），在无标签胸部X光图像上进行自监督预训练，目标是预测被掩码图像区域的潜在空间表示，区别于图文对齐或DINO式自蒸馏。 Result: 在疾病分类、语义分割和报告生成任务上，RadJEPA性能全面超越当前最优方法（包括Rad-DINO）。 Conclusion: 纯图像自监督学习可有效构建高性能放射学编码器，无需语言监督，为低资源医学AI提供了新范式。 Abstract: Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.

Zhaoqi Su,Shihai Chen,Xinyan Lin,Liqin Huang,Zhipeng Su,Xiaoqiang Lu

Main category: cs.CV

TL;DR: 本文提出ThermoSplat框架，通过跨模态FiLM调制和模态自适应几何解耦，实现RGB与热红外数据的深度光谱感知三维高斯点阵重建，在RGBT-Scenes数据集上达到可见光与热红外双谱渲染SOTA性能。

Details

Motivation: 现有3D高斯点阵（3DGS）方法难以有效融合RGB与热红外多模态数据，常忽视跨模态相关性或共享表征无法自适应处理不同光谱间的结构差异与物理不一致性。 Method: 提出ThermoSplat：1）跨模态FiLM调制机制，利用热成像结构先验动态调控共享隐特征以指导可见光纹理合成；2）模态自适应几何解耦方案，为热分支学习独立不透明度偏移并执行独立光栅化；3）融合球谐显式表征与神经隐式解码的混合渲染管线。 Result: 在RGBT-Scenes数据集上，ThermoSplat在可见光与热红外双谱渲染质量上均达到当前最优（state-of-the-art）。 Conclusion: ThermoSplat通过光谱感知的特征调制与几何解耦策略，显著提升了多光谱场景重建的鲁棒性与保真度，为复杂环境下的多模态三维感知提供了新范式。 Abstract: Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.

[89] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models

Zhen Zhang,Runhao Zeng,Sicheng Zhao,Xiping Hu

Main category: cs.CV

TL;DR: 本文通过系统性机制研究发现，多模态基础模型中的情感建模主要依赖于前馈网络中的门控投影层（gate_proj），而非注意力模块；仅微调该模块即可实现接近全参数微调的情感任务性能，显著提升参数效率。

Details

Motivation: 理解大规模基础模型中情感表征的内部机制，尤其是在多模态情感场景下，仍是一个开放问题。现有情感模型虽表现优异，但其支撑情感理解与生成的架构机制尚不清楚。 Method: 对多种架构、训练策略和情感任务开展系统性机制分析，考察情感监督如何重塑模型内部参数；采用受控模块迁移、单模块针对性适配和破坏性消融等方法验证gate_proj的作用。 Result: 情感适配主要定位在feed-forward gating projection（gate_proj）而非attention模块；仅微调gate_proj（约24.5%参数）即可达到AffectGPT 96.6%的平均性能；gate_proj被证实为情感理解与生成的充分、高效且必要模块。 Conclusion: 情感能力在基础模型中由前馈门控机制结构性介导，gate_proj是情感建模的核心架构位点。 Abstract: Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate\_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate\_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5\% of the parameters tuned by AffectGPT, our approach achieves 96.6\% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate\_proj} as a central architectural locus of affective modeling.

[90] The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars

Yarin Benyamin

Main category: cs.CV

TL;DR: 本文探讨了在VR环境中为自闭症谱系障碍（ASD）患者提供实时情绪识别支持的可行性，发现现有通用深度学习模型难以满足低延迟（<140ms）与高精度的双重需求，尤其在分类阶段存在‘延迟墙’；YOLOv11n在人脸检测中表现最优，而CLIP等通用视觉Transformer在准确率和速度上均不达标，亟需轻量级、领域专用架构。

Details

Motivation: 为ASD患者开发可及的VR治疗系统，需满足严格实时性（MTP<140ms）与情绪识别准确性之间的平衡，但现有SOTA模型未针对VR中虚拟角色表情识别优化。 Method: 在UIBVFED数据集上对多种零样本面部表情识别（FER）模型进行基准测试：包括YOLOv8/v11/v12的Medium/Nano变体用于人脸检测，以及CLIP、SigLIP、ViT-FER等通用视觉Transformer用于表情分类；全部实验基于纯CPU推理以贴近消费级硬件条件。 Result: 人脸检测在风格化虚拟头像上达100%准确率，YOLOv11n检测延迟约54ms；但分类阶段出现‘延迟墙’，CLIP/SigLIP准确率<23%且延迟>150ms，无法满足实时闭环要求。 Conclusion: 通用Transformer模型不适用于VR治疗中的实时零样本FER任务；必须设计轻量级、面向虚拟角色表情识别的专用架构，才能实现可及、实时的AI辅助干预。 Abstract: In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.

[91] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery

Valery Fischer,Alan Magdaleno,Anna-Katharina Calek,Nicola Cavalcanti,Nathan Hoffman,Christoph Germann,Joschua Wüthrich,Max Krähenmann,Mazda Farshad,Philipp Fürnstahl,Lilian Calvet

Main category: cs.CV

TL;DR: 本文提出了一种无需领域微调、仅依赖预训练模型的多视角3D手部姿态估计方法，并构建了首个大规模外科手术场景手部标注数据集，显著提升了2D和3D姿态估计精度。

Details

Motivation: 外科手术环境存在强局部光照、频繁遮挡、戴手套导致手部外观单一以及标注数据稀缺等挑战，亟需鲁棒的3D手部姿态估计方法。 Method: 提出一种多视角pipeline，整合人体检测、全身姿态估计、手部区域裁剪后的SOTA 2D关键点预测，再通过约束性3D优化；同时构建含68,000帧、3,000组人工标注2D姿态及三角化3D真值的外科手术基准数据集。 Result: 相比基线方法，2D平均关节误差降低31%，3D平均每关节位置误差降低76%。 Conclusion: 本工作为外科场景3D手部姿态估计提供了无需训练的实用pipeline和首个高质量标注数据集，奠定了该方向的研究基础。 Abstract: Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.

[92] Class Confidence Aware Reweighting for Long Tailed Learning

Brainard Philemon Jagati,Jitendra Tembhurne,Harsh Goud,Rudra Pratap Singh,Chandrashekhar Meshram

Main category: cs.CV

TL;DR: 本文提出了一种基于损失水平的类别与置信度感知重加权方案，用于解决长尾数据分布下的深度神经网络性能退化问题，该方案与现有logit调整方法互补，并在多个长尾数据集上验证了其有效性。

Details

Motivation: 深度神经网络在长尾数据分布下性能显著下降，现有研究主要关注决策空间（如logit层）的调整以补偿类别先验偏差，而忽视了因样本置信度差异带来的优化过程影响。 Method: 设计了一种纯基于损失水平的类别与置信度感知重加权方案，引入Ω(p_t, f_c)函数，根据预测置信度和类别相对频率动态调制各样本对训练的贡献。 Result: 在CIFAR-100-LT、ImageNet-LT和iNaturalist2018等多个长尾数据集上，不同不平衡因子下均取得显著性能提升，实验结果有力支撑了理论分析。 Conclusion: 所提出的重加权方案有效缓解了长尾学习中的类别不平衡问题，且与主流logit修正类方法具有互补性，为长尾学习提供了新思路。 Abstract: Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an Ω(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.

[93] NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation

Liuyun Jiang,Yizhuo Lu,Yanchao Zhang,Jiazheng Liu,Hua Han

Main category: cs.CV

TL;DR: 本文提出NeuroMamba，一种基于Mamba架构的多视角神经元分割框架，通过无patch全局建模与局部细节保留相结合，显著提升电镜图像中神经元边界的分割精度与鲁棒性。

Details

Motivation: 现有CNN方法缺乏长程上下文导致边界模糊，Transformer方法因分块操作丢失体素级细节而边界不精确，需兼顾全局依赖建模与精细结构保持。 Method: 提出NeuroMamba框架：1）通道门控的边界判别特征提取器（BDFE）增强局部形态线索；2）空间连续特征提取器（SCFE），将分辨率感知扫描机制嵌入Visual Mamba以自适应建模多分辨率全局依赖；3）跨调制机制融合多视角特征。整体采用Mamba实现patch-free、线性复杂度的全局建模。 Result: 在四个公开EM数据集上达到SOTA性能，对各向异性和各向同性分辨率均表现出优异适应性。 Conclusion: NeuroMamba有效解决了神经元分割中长程建模与细节保留的矛盾，为高精度脑连接组重建提供了新范式。 Abstract: Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.

[94] EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis

Sheng Miao,Sijin Li,Pan Wang,Dongfeng Bai,Bingbing Liu,Yue Wang,Andreas Geiger,Yiyi Liao

Main category: cs.CV

TL;DR: EvolSplat4D是一种新型前馈式4D高斯溅射框架，通过三分支结构统一处理近场静态、动态目标和远场场景，在城市驾驶数据集上实现了高质量、高一致性、快速的动态与静态场景新视角合成。

Details

Motivation: 现有方法在静态与动态城市场景的新视角合成中难以兼顾重建速度与质量：神经辐射场和3D高斯溅射需耗时的逐场景优化；而前馈方法多采用逐像素高斯表示，导致复杂动态环境中的3D不一致问题。 Method: 提出EvolSplat4D前馈框架，包含三个专用分支：（1）基于3D特征体素预测多帧一致的近距静态3D高斯几何，并结合语义增强图像渲染模块预测外观；（2）针对动态目标，采用以物体为中心的规范空间与运动校正渲染模块聚合时序特征；（3）远场场景由高效逐像素高斯分支覆盖。 Result: 在KITTI-360、KITTI、Waymo和PandaSet数据集上，EvolSplat4D在重建精度与一致性上均优于逐场景优化方法及前沿前馈基线方法。 Conclusion: EvolSplat4D通过融合体素级与像素级高斯预测范式，有效解决了动态城市场景中多尺度、多运动特性的新视角合成难题，为自动驾驶仿真提供了高效且高质量的4D重建方案。 Abstract: Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.

[95] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

Xin Xie,Jiaxian Guo,Dong Gong

Main category: cs.CV

TL;DR: 本文提出HyperAlign框架，通过训练超网络在测试时动态生成低秩适配权重来调节扩散模型的去噪过程，从而实现高效且有效的奖励对齐，避免了微调导致的多样性损失和测试时缩放带来的计算开销。

Details

Motivation: 现有扩散模型虽性能先进，但生成结果常与人类偏好和意图不一致，存在美学质量差和语义不一致问题；而现有对齐方法在多样性保持与计算效率之间难以兼顾。 Method: 提出HyperAlign框架，利用超网络动态生成低秩适配权重，调控扩散模型的生成算子；根据输入潜变量、时间步和提示自适应调整去噪轨迹；设计多种应用频率变体，并以带偏好数据正则化的奖励分数为目标优化超网络。 Result: 在Stable Diffusion和FLUX等模型上验证，HyperAlign显著优于现有微调和测试时缩放基线，在提升语义一致性和视觉吸引力方面效果突出。 Conclusion: HyperAlign提供了一种高效、灵活且鲁棒的测试时对齐新范式，有效缓解了奖励过优化与计算代价之间的权衡问题。 Abstract: Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.

[96] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Tingyu Song,Yanzhao Zhang,Mingxin Li,Zhuoning Guo,Dingkun Long,Pengjun Xie,Siyue Zhang,Yilun Zhao,Shu Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于图像编辑的合成方法，构建了细粒度的组合图像检索（CIR）新基准EDIR，并通过评估13个模型揭示了现有模型在多样化查询类别上的能力不足及当前基准的局限性。

Details

Motivation: 现有CIR基准查询类别有限，无法反映真实场景的多样性需求，亟需更全面、可控、细粒度的评估基准。 Method: 利用图像编辑技术精确控制修改类型与内容，构建覆盖5大类、15子类共5000个高质量查询的EDIR基准；并对13个跨模态嵌入模型进行系统评估，辅以对比分析和领域内训练实验。 Result: 现有最先进模型（如RzenEmbed、GME）在EDIR上表现不稳定，暴露出显著能力差距；发现现有基准存在模态偏差和类别覆盖不足等问题；领域内训练可提升部分子类性能，但某些子类仍暴露模型架构固有局限。 Conclusion: EDIR是一个更具挑战性和现实代表性的CIR基准，能有效揭示模型真实能力边界，推动更鲁棒、泛化更强的多模态检索模型发展。 Abstract: Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

[97] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

Chak-Wing Mak,Guanyu Zhu,Boyi Zhang,Hongji Li,Xiaowei Chi,Kevin Zhang,Yichen Wu,Yangfan He,Chun-Kai Fan,Wentao Lu,Kuangzhi Ge,Xinyu Fang,Hongyang He,Kuan Lu,Tianxiang Xu,Li Zhang,Yongxin Ni,Youhua Li,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文提出PhysicsMind基准，用于评估多模态大模型和视频世界模型对物理规律（质心、杠杆平衡、牛顿第一定律）的理解能力，涵盖视觉问答与视频生成两大任务，并揭示当前模型仍依赖外观启发式而违背基本力学原理。

Details

Motivation: 现有基准无法有效衡量模型对物理规律的理解，多依赖合成模板或感知质量，缺乏对物理守恒律一致性的评估。 Method: 构建PhysicsMind统一基准，包含真实场景与仿真环境，设计VQA和视频生成两类任务，分别测试物理量推理与运动轨迹是否符合质心、力矩和惯性约束。 Result: 在PhysicsMind上评测的主流多模态及视频生成模型表现较差，常违反基本力学原理，表明当前训练策略和规模不足以支撑鲁棒的物理理解。 Conclusion: PhysicsMind为物理感知多模态模型提供了聚焦、可扩展的评测平台，凸显了引入物理先验建模的必要性。 Abstract: Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.

[98] Keyframe-Based Feed-Forward Visual Odometry

Weichen Dai,Wenhan Su,Da Kong,Yuhang Ming,Wanzeng Kong

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的自适应关键帧选择策略，用于视觉里程计（VO），在不依赖手工规则的前提下，提升基于视觉基础模型的前馈式VO性能。

Details

Motivation: 现有基于视觉基础模型的VO方法（如VGGT-Long）直接处理原始图像序列，缺乏关键帧机制，导致计算冗余和因帧间视差小而引起的性能下降；且难以将传统几何启发式方法融入依赖高维隐表示的基础模型中。 Method: 提出一种基于强化学习的关键帧选择策略，以数据驱动方式学习适配基础模型特性的关键帧策略，并在TartanAir数据集上训练，在多个真实世界数据集上评估。 Result: 实验表明，该方法在多个真实世界数据集上一致且显著地超越了当前最先进的前馈式VO方法。 Conclusion: 将强化学习引入关键帧选择，可有效弥合传统几何方法与现代视觉基础模型之间的鸿沟，提升VO的效率与精度。 Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.

[99] PAINT: Pathology-Aware Integrated Next-Scale Transformation for Virtual Immunohistochemistry

Rongze Ma,Mengkang Lu,Zhenyu Xiang,Yongsheng Pan,Yicheng Wu,Qingjie Zeng,Yong Xia

Main category: cs.CV

TL;DR: 本文提出PAINT框架，通过结构优先的自回归建模，利用空间结构起始图（3S-Map）实现从H&E图像高保真合成IHC染色图像，显著提升结构一致性和临床任务性能。

Details

Motivation: 传统IHC染色成本高、耗组织；而现有基于外观合成的虚拟IHC方法因缺乏足够结构先验，易导致语义不一致。 Method: 提出Pathology-Aware Integrated Next-Scale Transformation（PAINT），一种视觉自回归框架：以全局结构布局为条件，按因果顺序逐级生成分子细节；核心是引入Spatial Structural Start Map（3S-Map），将自回归初始化锚定于观测到的形态学结构，确保空间对齐与确定性合成。 Result: 在IHC4BC和MIST数据集上，PAINT在结构保真度和临床下游任务（如蛋白表达量化、亚型分类）中均优于当前最优方法。 Conclusion: 结构引导的自回归建模是提升虚拟IHC合成质量与临床可用性的有效范式。 Abstract: Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\&E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H\&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.

[100] ProGiDiff: Prompt-Guided Diffusion-Based Medical Image Segmentation

Yuan Lin,Murong Xu,Marc Hölle,Chinmay Prabhakar,Andreas Maier,Vasileios Belagiannis,Bjoern Menze,Suprosanna Shit

Main category: cs.CV

TL;DR: 本文提出ProGiDiff框架，利用预训练扩散模型和ControlNet式条件机制实现基于自然语言提示的多类医学图像分割，并支持跨模态迁移与专家交互。

Details

Motivation: 现有医学图像分割方法缺乏对自然语言提示的支持、多提案生成能力、人机交互性及跨模态适应性；而文本到图像扩散模型虽有潜力，但需大量数据、难以直接用于多类分割且无法响应自然语言提示。 Method: 提出ProGiDiff框架，设计ControlNet风格的图像条件机制与定制编码器，将预训练扩散模型引导输出分割掩码；通过自然语言提示指定目标器官以支持多类分割；采用低秩、少样本适配实现跨模态（CT→MR）迁移。 Result: 在CT器官分割任务上性能优于先前方法；支持专家参与下的多提案优化；所学条件机制可经少量样本适配迁移到MR图像分割。 Conclusion: ProGiDiff有效融合生成式建模与分割任务，兼顾灵活性、可解释性与跨模态泛化能力，为交互式、提示驱动的医学图像分割提供了新范式。 Abstract: Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.

[101] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Chenyang Li,Jieyuan Liu,Bin Li,Bo Gao,Yilin Yuan,Yangfan He,Yuchen Li,Jingqun Tang

Main category: cs.CV

TL;DR: 本文提出了一种即插即用的干扰图像令牌剪枝（DTP）框架，用于提升视觉-语言动作（VLA）模型在机器人操作任务中的成功率，通过动态检测并剪除任务无关区域的干扰图像令牌，改善模型视觉注意力模式，无需修改模型架构或增加额外输入。

Details

Motivation: VLA模型默认可能过度关注图像中任务无关区域的令牌（即‘干扰令牌’），从而干扰动作生成，降低任务成功率。 Method: 提出干扰令牌剪枝（DTP）框架，动态检测并剪除任务无关区域的图像令牌，以校正模型视觉注意力模式。 Result: 在SIMPLER基准上，DTP在多种新型VLA模型上均取得一致的相对成功率提升，展现出对Transformer类VLA模型的良好泛化性；分析发现任务成功率与任务无关区域注意力强度呈负相关。 Conclusion: DTP是一种简单有效、即插即用的方法，能提升VLA模型性能上限，揭示了VLA模型中普遍存在的注意力偏差问题，为后续研究提供方向。 Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.

[102] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models

Hanwen Zhang,Qiaojin Shen,Yuxi Liu,Yuesheng Zhu,Guibo Luo

Main category: cs.CV

TL;DR: DSFedMed is a dual-scale federated framework for medical image segmentation that enables mutual knowledge distillation between a centralized foundation model and lightweight client models, improving performance while drastically reducing communication and inference costs.

Details

Motivation: Foundation Models (FMs) face challenges in federated settings due to high computational demands, communication overhead, and inference costs—especially critical in resource-limited medical applications. Method: DSFedMed introduces mutual knowledge distillation between a central foundation model and lightweight client models; it uses synthetically generated high-quality medical images and a learnability-guided sample selection strategy to enhance distillation efficiency and effectiveness. Result: On five medical imaging segmentation datasets, DSFedMed achieves ~2% average Dice score improvement and reduces communication costs and inference time by ~90% compared to existing federated FM baselines. Conclusion: DSFedMed significantly improves efficiency and scalability of foundation models in federated medical image segmentation, enabling practical deployment under resource constraints. Abstract: Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.

[103] Masked Modeling for Human Motion Recovery Under Occlusions

Zhiyin Qian,Siwei Zhang,Bharat Lal Bhatnagar,Federica Bogo,Siyu Tang

Main category: cs.CV

TL;DR: MoRo是一种基于掩码建模的端到端生成式方法，用于从单目视频中鲁棒地重建人体运动，尤其在频繁遮挡场景下表现优异，兼顾高精度、高真实感与实时推理（70 FPS）。

Details

Motivation: 现有方法在遮挡场景下存在脆弱性（回归法）或效率低（优化/扩散法），且缺乏足够配对的视频-动作数据。 Method: 提出MoRo框架：采用视频条件下的掩码建模；设计跨模态学习方案，融合三类先验——运动捕捉数据训练的轨迹感知运动先验、图像-姿态数据训练的姿态先验、以及在视频-动作数据上微调的视频条件掩码Transformer。 Result: 在EgoBody和RICH数据集上显著超越SOTA方法，遮挡下精度与运动真实感更优，无遮挡时性能相当；单H200 GPU达70 FPS实时推理。 Conclusion: MoRo通过掩码建模与跨模态先验融合，实现了高效、鲁棒、端到端的人体运动重建，为真实场景应用提供了新范式。 Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.

[104] SAMTok: Representing Any Mask with Two Words

Yikang Zhou,Tao Zhang,Dengxian Gong,Yuanzheng Wu,Ye Tian,Haochen Wang,Haobo Yuan,Jiacong Wang,Lu Qi,Hao Fei,Anran Wang,Zhuochen Wang,Yujing Wang,Cheng Chen,Shunping Ji,Xiangtai Li

Main category: cs.CV

TL;DR: 本文提出SAMTok，一种离散掩码分词器，将区域掩码转化为两个特殊token，使基础多模态大模型（如QwenVL）无需架构修改即可通过标准语言建模和简单强化学习获得像素级理解能力。

Details

Motivation: 现有像素级多模态大模型难以扩展，受限于复杂的区域编码器、专用分割解码器及不兼容的训练目标。 Method: 提出SAMTok：基于SAM2，利用掩码编码器和残差向量量化器将任意掩码编码为两个离散、紧凑且信息丰富的token；在2.09亿掩码上预训练，并构建500万SAMTok格式数据；结合文本答案匹配奖励进行高效强化学习。 Result: QwenVL-SAMTok在区域描述、区域视觉问答、接地对话、指代表达分割、场景图解析和多轮交互式分割等任务上达到SOTA或相当水平；在GRES和GCG基准上显著提升。 Conclusion: SAMTok提供了一种可扩展、简洁有效的范式，使多模态大模型无需复杂修改即可获得强像素级能力。 Abstract: Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

[105] Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification

Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Quinn Ledingham,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 本文提出CSSMamba框架，通过聚类引导的空间-光谱Mamba结构及注意力驱动的令牌选择机制，提升高光谱图像分类性能。

Details

Motivation: 现有Mamba模型在高光谱图像分类中面临难以定义高效且自适应令牌序列的关键挑战。 Method: 提出CSSMamba框架，包括：1）聚类引导的空间Mamba模块（CSpaMamba）以缩短序列长度并增强特征学习；2）结合光谱Mamba模块（SpeMamba）构建完整空间-光谱架构；3）引入注意力驱动的令牌选择机制优化令牌序列；4）设计可学习聚类模块实现自适应聚类集成。 Result: 在Pavia University、Indian Pines和Liao-Ning 01数据集上，CSSMamba在分类精度和边界保持能力上均优于当前主流CNN、Transformer及Mamba方法。 Conclusion: CSSMamba通过融合聚类、注意力与空间-光谱建模，有效提升了Mamba在高光谱图像分类任务中的效率与性能。 Abstract: Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.

[106] Learning to Watermark in the Latent Space of Generative Models

Sylvestre-Alvise Rebuffi,Tuan Tran,Valeriu Lacatusu,Pierre Fernandez,Tomáš Souček,Nikola Jovanović,Tom Sander,Hady Elsahar,Alexandre Mourachko

Main category: cs.CV

TL;DR: 本文提出DistSeal，一种在生成模型潜在空间中进行水印嵌入的统一方法，通过在潜在空间训练后处理水印模型并将其蒸馏至生成模型或潜在解码器中，实现高效、鲁棒且不可感知的水印。

Details

Motivation: 现有AI图像水印方法多依赖像素空间的后处理，存在计算开销大和引入视觉伪影的问题。 Method: 提出潜在空间水印方法DistSeal，在扩散模型和自回归模型的潜在空间中训练后处理水印模型，并将其蒸馏到生成模型或潜在解码器中，实现模型内水印。 Result: 潜在水印在保持与像素空间方法相近不可感知性的同时，鲁棒性相当，并获得最高20倍的速度提升；蒸馏潜在水印器比蒸馏像素水印器更优。 Conclusion: 潜在空间水印是一种更高效、更鲁棒的AI生成图像水印方案，DistSeal为跨架构生成模型提供了统一、实用的水印框架。 Abstract: Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.

[107] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Remy Sabathier,David Novotny,Niloy J. Mitra,Tom Monnier

Main category: cs.CV

TL;DR: ActionMesh 是一种新型生成模型，通过引入时间轴扩展3D扩散模型，实现从单目视频、文本或3D网格等输入快速生成高质量、拓扑一致、无需绑定骨架的动画3D网格。

Details

Motivation: 现有生成动画3D物体的方法受限于设置复杂、运行时间长或质量不足，难以实际应用。 Method: 提出‘时间3D扩散’框架：1）修改3D扩散模型以生成时序同步的3D形状隐表示；2）设计时间3D自编码器，将独立形状序列映射为参考形状的形变序列，从而构建动画。 Result: 在Consistent4D和Objaverse等标准基准上达到几何精度与时间一致性SOTA；生成速度快、结果无骨架依赖、拓扑一致，支持高效纹理映射与动作重定向。 Conclusion: ActionMesh实现了高质量、高效率、易集成的动画3D网格生成，显著推动了生成式4D内容的实际落地。 Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

[108] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Zequn Xie,Xin Liu,Boyun Zhang,Yuxiao Lin,Sihang Cai,Tao Jin

Main category: cs.CV

TL;DR: 本文提出了一种受人类视觉启发的文本-视频检索模型HVD，通过粗到细的对齐机制（包括帧特征选择模块FFSM和块特征压缩模块PFCM）提升检索性能，在五个基准上达到SOTA。

Details

Motivation: 现有方法在文本-视频检索中存在“盲”特征交互问题，难以从背景噪声中识别关键视觉信息，源于文本查询稀疏性。 Method: 提出Human Vision-Driven (HVD)模型，包含两个模块：Frame Features Selection Module (FFSM)用于选择关键帧以消除时间冗余；Patch Features Compression Module (PFCM)通过先进注意力机制聚合块特征为显著视觉实体，实现细粒度实体级匹配。 Result: 在五个主流文本-视频检索基准上取得SOTA性能，同时展现出类人的视觉聚焦能力。 Conclusion: HVD模型通过模拟人类宏观与微观视觉感知机制，有效缓解了文本稀疏性导致的特征交互盲区问题，验证了认知启发建模在跨模态检索中的有效性。 Abstract: The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

[109] 360Anything: Geometry-Free Lifting of Images and Videos to 360°

Ziyi Wu,Daniel Watson,Andrea Tagliasacchi,David J. Fleet,Marcus A. Brubaker,Saurabh Saxena

Main category: cs.CV

TL;DR: 本文提出360Anything，一种无需几何先验和相机参数的、基于扩散Transformer的数据驱动方法，用于将透视图像/视频生成360°全景图，并解决ERP边界接缝问题，同时展现出隐式几何理解能力。

Details

Motivation: 现有方法依赖已知相机参数进行透视到球面投影的显式几何对齐，难以适用于野外未知或噪声大的相机数据。 Method: 提出360Anything框架，将透视输入与ERP目标均视为token序列，基于预训练扩散Transformer实现纯数据驱动映射；引入Circular Latent Encoding解决VAE零填充导致的ERP边界 seam 问题。 Result: 在图像和视频的透视到360°生成任务上达到SOTA，优于使用真实相机参数的先前方法；在零样本FoV与朝向估计基准上表现具竞争力。 Conclusion: 360Anything摆脱了对相机标定的依赖，通过纯学习方式建模透视-全景映射，兼具高质量生成能力与隐式几何理解，拓展了360°内容生成与几何推理的应用边界。 Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.

[110] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong,Boyang Zheng,Ziteng Wang,Bingda Tang,Nanye Ma,Ellis Brown,Jihan Yang,Rob Fergus,Yann LeCun,Saining Xie

Main category: cs.CV

TL;DR: 本文研究了Representation Autoencoders（RAEs）在大规模文本到图像（T2I）生成任务中的可扩展性，发现RAE在预训练和微调阶段均优于VAE，收敛更快、生成质量更高，并支持统一的多模态表征与推理。

Details

Motivation: 探究RAE框架能否从ImageNet扩展到大规模、自由形式的文本到图像生成任务，并验证其在不同数据、架构与训练规模下的有效性与鲁棒性。 Method: 在冻结SigLIP-2编码器基础上，扩展RAE解码器并使用网络、合成及文本渲染数据训练；系统评估RAE关键设计选择（如噪声调度、扩散头宽度、噪声增强解码）；在0.5B–9.8B参数范围内与FLUX VAE进行控制变量对比实验，涵盖预训练与高质数据微调。 Result: RAE在所有模型尺度下预训练性能均优于VAE；VAE微调64轮后灾难性过拟合，而RAE稳定训练至256轮且性能更优；RAE收敛更快、生成质量更高；共享表征空间支持视觉理解与生成联合推理。 Conclusion: RAE是一种比VAE更简单、更强健的大规模T2I生成基础架构，具备更好的可扩展性、稳定性与多模态统一潜力。 Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

[111] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Onkar Susladkar,Tushar Prakash,Adheesh Juvekar,Kiet A. Nguyen,Dong-Hwan Jang,Inderjit S Dhillon,Ismini Lourentzou

Main category: cs.CV

TL;DR: 本文提出PyraTok，一种语言对齐的金字塔式视频分词器，通过多尺度离散化和联合文本引导量化，显著提升视频重建、文本生成视频及零样本视频理解性能。

Details

Motivation: 现有离散视频VAE分词器通常在单一尺度学习有限词汇量的视觉码本，且语言监督较弱，导致跨模态对齐差、零样本迁移能力弱。 Method: PyraTok基于预训练视频VAE，引入语言对齐金字塔量化（LaPQ）模块，在多个时空分辨率上用共享大二进制码本离散化编码器特征，并联合优化多尺度文本引导量化与层级自回归目标。 Result: 在十个基准上达到SOTA视频重建效果；持续提升文本到视频生成质量；在视频分割、时序动作定位和视频理解任务上取得零样本SOTA性能，并可稳健扩展至4K/8K分辨率。 Conclusion: PyraTok通过语义结构化、多尺度、语言对齐的离散表示，有效解决了视频分词中跨模态对齐与泛化能力不足的问题，为视频生成与理解提供了更优基础组件。 Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

[112] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn,Inwoong Lee,Taeoh Kim,Minho Shim,Dongyoon Wee,Jinwoo Choi

Main category: cs.CV

TL;DR: 本文研究了组合视频理解（CVU）任务，发现现有零样本组合动作识别（ZS-CAR）模型因物体驱动的动词捷径行为而失败；为此提出RCORE框架，通过组合感知增强和时序顺序正则化损失来缓解该问题，并在多个基准上显著提升未见组合的识别性能。

Details

Motivation: 现有ZS-CAR模型在未见动词-物体组合上泛化能力差，主因是模型依赖物体共现统计而非真正学习动词的视觉语义，即存在‘物体驱动的动词捷径’这一被忽视的失败模式。 Method: 提出RCORE框架：（i）组合感知数据增强，在保持运动线索前提下丰富动词-物体组合；（ii）时序顺序正则化损失，显式建模时间结构以抑制捷径行为。 Result: 在Sth-com和新构建的EK100-com两个基准上，RCORE显著提升未见组合准确率，降低对共现偏差的依赖，并实现稳定正向的组合性差距（compositional gap）。 Conclusion: 物体驱动的捷径行为是ZS-CAR的关键瓶颈；只有显式抑制该行为，才能实现鲁棒的组合视频理解。 Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

[113] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Wenhang Ge,Guibao Shen,Jiawei Feng,Luozhou Wang,Hao Lu,Xingye Tian,Xin Tao,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出CamPilot，通过引入相机感知的3D解码器与基于几何一致性的奖励机制，提升视频扩散模型的相机可控性。

Details

Motivation: 现有相机可控视频扩散模型在相机对齐方面仍受限，且传统Reward Feedback Learning（ReFL）方法存在奖励模型无法评估视频-相机对齐、RGB解码开销大、忽略3D几何信息等问题。 Method: 提出高效相机感知3D解码器，将视频隐表示与相机位姿联合解码为3D高斯；利用相机位姿作为投影参数，使错位导致几何畸变与渲染模糊；以渲染新视角与真实图像间的像素级一致性为显式奖励，并引入基于几何形变确定区域的可见性加权监督。 Result: 在RealEstate10K和WorldScore基准上验证了方法有效性，显著提升了视频-相机对齐精度与可控性。 Conclusion: 通过将相机位姿深度融入3D解码与奖励建模，CamPilot实现了更高效、几何一致的相机可控视频生成，为视频扩散模型的可控性提供了新范式。 Abstract: Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.

Table of Contents

cs.CL [Back]

[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration

[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

[3] Embedding Retrofitting: Data Engineering for better RAG

[4] MALTopic: Multi-Agent LLM Topic Modeling Framework

[5] Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis

[6] Can We Trust LLM Detectors?

[7] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

[8] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

[9] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models

[10] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs

[11] Memorization Dynamics in Knowledge Distillation for Language Models

[12] Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind

[13] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs

[14] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

[15] Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

[16] Multi-Persona Thinking for Bias Mitigation in Large Language Models

[17] ViT Registers and Fractal ViT

[18] Computational Representations of Character Significance in Novels

[19] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains

[20] Common to Whom? Regional Cultural Commonsense and LLM Bias in India

[21] From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare

[22] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

[23] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

[24] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

[25] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

[26] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking

[27] Persona Switch: Mixing Distinct Perspectives in Decoding Time

[28] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

[29] Hallucination Mitigating for Medical Report Generation

[30] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs

[31] HumanLLM: Towards Personalized Understanding and Simulation of Human Nature

[32] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics

[33] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection

[34] Can professional translators identify machine-generated text?

[35] Determinants of Training Corpus Size for Clinical Text Classification

[36] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers

[37] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

[38] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech

[39] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

[40] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

[41] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating

[42] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier

[43] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

[44] Automatic Classification of Arabic Literature into Historical Eras

[45] LLM-in-Sandbox Elicits General Agentic Intelligence

cs.CV [Back]

[46] AI-Based Culvert-Sewer Inspection

[47] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition

[48] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

[49] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction

[50] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection

[51] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

[52] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis

[53] Controllable Layered Image Generation for Real-World Editing

[54] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views

[55] VIOLA: Towards Video In-Context Learning with Minimal Annotations

[56] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation

[57] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition

[58] Explainable Deepfake Detection with RL Enhanced Self-Blended Images

[59] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception

[60] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction

[61] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

[62] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

[63] Consistency-Regularized GAN for Few-Shot SAR Target Recognition

[64] Performance-guided Reinforced Active Learning for Object Detection

[65] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

[66] Enhanced LULC Segmentation via Lightweight Model Refinements on ALOS-2 SAR Data

[67] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework

[68] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

[69] FAIR-ESI: Feature Adaptive Importance Refinement for Electrophysiological Source Imaging

[70] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation

[71] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework

[72] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification

[73] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)

[74] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps

[75] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting

[76] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation

[77] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video