Skip to content

Table of Contents

cs.CL [Back]

[1] A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

Eric Jeangirard

Main category: cs.CL

TL;DR: 提出一个包含83.3万个段落的数据集,涵盖致谢、数据引用、软件/代码引用和临床试验引用四类,用于科学文献挖掘的文本分类和命名实体识别模型训练。

Details Motivation: 为了支持科学文献中关键信息的自动提取,推动开放科学的发展,需要一个大规模、多语言且标注良好的数据集来训练和评估文本挖掘模型。 Method: 从法国开放科学监测器语料库中提取CC-BY许可的科学出版物段落,使用GROBID处理,并通过fastText进行语言识别,结合OpenAlex标注学科领域,最终构建四分类标注数据集。 Result: 构建了一个包含833,000个段落的数据集,主要为英语和法语,涵盖多个欧洲语言,每个段落均标注语言和科学领域,分类准确率高,适用于文本分类与命名实体识别任务。 Conclusion: 该数据集为科学文献中的致谢、数据、软件和临床试验信息提取提供了有力支持,有助于推动科研透明度和可重复性研究的发展。 Abstract: We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

[2] Policy Optimization Prefers The Path of Least Resistance

Debdeep Sanyal,Aakash Sen Sharma,Dhruv Kumar,Saurabh Deshpande,Murari Mandal

Main category: cs.CL

TL;DR: 策略优化在开放式的思维链结构中倾向于走“阻力最小路径”,忽略复杂推理,直接生成答案,即使给予更高奖励权重也难以避免格式退化。

Details Motivation: 研究当放宽严格思维链格式限制时,策略优化算法在开放式推理结构中的行为表现,尤其是其是否仍能保持复杂推理能力。 Method: 通过一系列受控实验,分析不同模型和算法下策略优化的行为,采用奖励分解方法探究其优化优先级,并测试KL正则化对策略变化的影响。 Result: 发现策略优化总是倾向于选择最简单的奖励路径,导致推理过程被抛弃,退化为仅输出答案的格式;该现象在多种模型和高奖励权重下依然存在;且需要足够的策略自由度才能实现向高奖励捷径的收敛。 Conclusion: 赋予策略自由探索的能力是一把双刃剑:虽有助于发现高奖励路径,但也容易导致奖励博弈,凸显了对齐过程中防止奖励滥用的关键挑战。 Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt{} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.

[3] Language Ranker: A Lightweight Ranking framework for LLM Decoding

Chenheng Zhang,Tianqi Du,Jizhe Zhang,Mingqing Xiao,Yifei Wang,Yisen Wang,Zhouchen Lin

Main category: cs.CL

TL;DR: 本文提出了一种名为Language Ranker的新框架,通过借鉴推荐系统的思想对LLM生成过程中的解码阶段进行优化,在显著降低计算开销的同时实现了与大规模奖励模型相当的性能。

Details Motivation: 传统解码方法和奖励模型在LLM生成中存在冗余等问题,且现有基于奖励模型的方法计算成本高、适用性有限,因此需要一种更高效、轻量的解码优化方式。 Method: 将解码过程类比为推荐系统中的排序阶段,引入一个轻量级模块,利用基础模型提取的特征对候选响应进行重排序。 Result: 在多种任务上实验表明,Language Ranker仅需不到0.5M额外参数,即可达到与大规模奖励模型相当的性能,并显著降低了训练和推理阶段的计算开销。 Conclusion: Language Ranker高效且有效,能够充分释放大语言模型的潜力,为解码过程提供了新的优化思路。 Abstract: Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the importance of decoding, but these methods often suffer from high computational costs and limited applicability. In this paper, we revisit LLM generation through the lens of recommender systems, conceptualizing the decoding process as analogous to the ranking stage in recommendation pipelines. From this perspective, we observe that both traditional decoding methods and reward models exhibit clear limitations such as redundancy. Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses using features extracted by the base model. Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only <0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. This highlights the efficiency and effectiveness of our method, showcasing its potential to fully unlock the capabilities of LLMs.

[4] Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks

Avinash Patil

Main category: cs.CL

TL;DR: 本文提出了RACE框架,用于评估大语言模型生成的解释与基于逻辑回归的特征重要性之间的对齐程度,揭示了正确和错误预测中支持与矛盾特征覆盖的不对称性。

Details Motivation: 随着机器学习在敏感领域的广泛应用,对可解释AI的需求增加,但大语言模型生成的推理是否真实反映决策依据尚不明确。 Method: 提出RACE框架,结合token级、精确字符串和编辑距离匹配方法,比较LLM生成的解释与逻辑回归提取的关键词汇特征在四个文本分类数据集上的对齐情况。 Result: 实验发现正确预测更常覆盖支持性特征,错误预测则更多包含矛盾特征;编辑距离匹配能发现 paraphrastic 重叠,提升覆盖率并保持这种不对称性。 Conclusion: LLM生成的解释虽融合了表面和灵活的证据复用,但在出错时可能放大误导性线索;RACE为评估神经语言模型的推理完整性提供了量化基础。 Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.

[5] Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

Anh Pham,Mihir Thalanki,Michael Sun,Aditya Chaloo,Ankita Gupta,Tian Xia,Aditya Mate,Ehimwenma Nosakhare,Soundararajan Srinivasan

Main category: cs.CL

TL;DR: 提出一种基于行为感知的采样框架,通过选择具有特定指令-响应行为和语义多样性的安全样本来缓解大模型微调中的灾难性遗忘问题,在仅增加0.5%数据的情况下实现最多41%的危害减少。

Details Motivation: 解决大语言模型在微调时因灾难性遗忘而失去原有安全对齐行为的问题,明确哪些安全样本更有效。 Method: 设计一个行为感知的采样框架,结合指令-响应行为(如拒绝或服从)和跨危害类别的语义多样性来选择安全示例。 Result: 该方法显著减少了有害输出,最多实现41%的危害性降低,且仅需额外0.5%的训练数据,同时保持模型的有用性。 Conclusion: 有针对性的数据选择可以有效提升大规模微调过程中的安全性和效率。 Abstract: Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.

[6] Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation

Dhrupad Bhardwaj,Julia Kempe,Tim G. J. Rudner

Main category: cs.CL

TL;DR: 提出语义各向同性(semantic isotropy)作为评估大语言模型长文本生成结果可信度的新方法,通过嵌入向量在单位球面上的角分散度来预测非事实性,无需标注数据、微调或超参选择,具有低成本和高实用性。

Details Motivation: 需要一种可靠且计算成本低的方法来评估大语言模型在开放性问题下生成的长文本回应的可信度,而现有基于逐句事实核查的方法计算开销大且脆弱。 Method: 引入语义各向同性概念,通过生成多个长文本回应并将其嵌入到单位球面,计算其归一化文本嵌入的角分散度来衡量语义各向同性水平,从而评估回应的事实一致性。 Result: 发现较高的语义各向同性(即更大的嵌入分散度)可靠地指示较低的事实一致性,在多个领域中仅用少量样本即可优于现有方法预测长文本中的非事实性。 Conclusion: 该方法无需标注数据、微调或超参数选择,可与开源或闭源嵌入模型结合使用,为现实世界的大语言模型工作流提供了一种实用且低成本的信任评估方案。 Abstract: To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater embedding dispersion -- reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples -- offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.

[7] Understanding Network Behaviors through Natural Language Question-Answering

Mingzhe Xing,Chang Tian,Jianan Zhang,Lichen Pan,Peipei Liu,Zhaoteng Yan,Yinliang Yue

Main category: cs.CL

TL;DR: 本文提出NetMind,一个使用自然语言查询网络的新型框架,通过树形配置分块、统一事实图和混合指令-声明语言,实现对大规模网络行为的准确与可扩展理解。

Details Motivation: 现有基于领域特定语言的方法存在学习成本高、灵活性差的问题,而自然语言接口更易用且可解释,结合大模型的知识与推理能力有望提升网络行为理解,但面临长上下文、设备异构性和复杂推理等挑战。 Method: 提出NetMind框架:1)采用树形结构对配置文件进行语义保持的分块;2)构建统一的事实图以标准化不同厂商的配置;3)设计混合指令-声明语言降低大模型的推理负担;4)构建包含自然语言问答对和网络配置的基准测试集。 Result: 实验表明,NetMind在准确性和可扩展性方面优于现有基线方法,能够有效应对长上下文、设备异构和复杂协议下的网络行为理解任务。 Conclusion: NetMind通过结构化中间表示和语言设计,显著提升了大语言模型在复杂网络环境中的行为理解能力,为自然语言驱动的网络管理提供了可行路径。 Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM's long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.

[8] Deep Literature Survey Automation with an Iterative Workflow

Hongbo Zhang,Han Cui,Yidong Wang,Yijian Tian,Qi Guo,Cunxiang Wang,Jian Wu,Chiyu Song,Yue Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于迭代提纲生成的自动文献综述框架\ours,通过模拟人类研究者的阅读过程,逐步检索、阅读并更新综述结构,结合论文卡片和审阅优化机制,显著提升了自动生成综述的内容覆盖度、结构连贯性和引用质量。

Details Motivation: 现有自动文献综述系统多采用一次性检索与静态提纲生成,易导致信息噪声、结构碎片化和上下文过载,影响综述质量。因此需要一种更接近人类研究者迭代阅读过程的动态方法。 Method: 提出\ours框架,采用递归提纲生成机制,由规划代理逐步检索和更新提纲;设计论文卡片以提炼每篇论文的贡献、方法和发现;引入带有可视化增强的审阅与优化循环,提升文本流畅性并融合图表等多模态元素。 Result: 在成熟与新兴主题上的实验表明,\ours在内容覆盖、结构连贯性和引用质量上显著优于现有最先进基线方法,并生成更易读、组织更优的综述;同时提出Survey-Arena配对基准,用于更可靠地评估机器生成综述相对于人工撰写综述的水平。 Conclusion: \ours通过模拟人类迭代阅读过程,实现了高质量、有良好论文支撑的自动文献综述生成,结合新提出的评估基准Survey-Arena,为自动综述生成提供了更有效的框架与评估方式。 Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey\_Autosurveyv2.

[9] Explaining and Mitigating Crosslingual Tokenizer Inequities

Catherine Arnett,Tyler A. Chang,Stella Biderman,Benjamin K. Bergen

Main category: cs.CL

TL;DR: 本文研究了不同语言在单语分词器下的令牌溢价差异,通过训练约7000个针对97种语言的单语分词器,发现词汇表大小和预分词方式显著影响令牌溢价,而训练与测试数据的相似性则无显著影响。通过调整词汇表大小或使用跨空格合并的超词分词器,可显著降低跨语言令牌溢价。

Details Motivation: 不同语言在编码时产生的令牌数量存在差异(即令牌溢价),这会影响训练吞吐量和推理成本。尽管控制了数据集大小、词汇表大小和内容,单语分词器在不同语言间仍表现出显著的令牌溢价差异,因此需要探究其成因并寻找缓解方法。 Method: 训练了约7000个覆盖97种语言的单语分词器,系统地操控分词算法、词汇表大小和数据集大小,并测量各语言的令牌溢价;分析数据相似性、词汇表大小、预分词方式及语言特征(如书写系统、词长)对令牌溢价的影响。 Result: 发现训练与测试数据的相似性不影响令牌溢价,但词汇表大小和预分词方式有显著影响;通过为每种语言设定‘最优’词汇表大小,可显著降低令牌溢价;采用允许跨空格合并的超词分词器能进一步减少令牌溢价并提升压缩效果。 Conclusion: 调整分词器的词汇表大小或预分词策略可有效缓解跨语言令牌溢价问题,尤其是在多语言模型中采用针对语言优化的分词策略具有实际意义。 Abstract: The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at inference. In this paper, we show that even after controlling for dataset size, vocabulary size, and data content, monolingual tokenizers exhibit a wide range of token premiums across languages. To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. We also investigate the role of language-specific features such as writing system and word length. We find that similarity between training and test data does not impact token premiums, but vocabulary size and pre-tokenization do. While simply increasing vocabulary size does not lead to reduced token premium effects, we can determine an ``optimal'' vocabulary size for each language to achieve significantly reduced token premium effects. We also train superword tokenizers which allow merges over whitespaces, and we find that they both reduce token premium effects and improve compression overall. Thus, intervening on the vocabulary size or the pre-tokenizer significantly reduces crosslingual token premium effects.

[10] Model-Aware Tokenizer Transfer

Mykola Haltiuk,Aleksander Smywiński-Pohl

Main category: cs.CL

TL;DR: 提出了一种名为MATT的模型感知分词器迁移方法,通过引入注意力影响建模(AIM)目标,将源模型中的令牌间通信模式蒸馏到具有新分词器的目标模型中,从而在多语言大模型中实现高效、鲁棒的分词器迁移。

Details Motivation: 现有的分词器迁移方法通常依赖语义启发式来初始化新嵌入,忽略了高层模型动态,限制了迁移质量,尤其是在低资源或不同文字的语言上。 Method: 提出Model-Aware Tokenizer Transfer (MATT),利用Attention Influence Modeling (AIM) 目标,从源模型中提取令牌间的注意力交互模式,并将其迁移到目标模型的新分词器中,作为语言建模前的有效预热步骤。 Result: 在多种语言环境下实验表明,MATT能在几小时内恢复原始模型的大部分性能,显著优于基于启发式的基线方法。 Conclusion: 将模型内部信号纳入分词器迁移过程是一种实用且有效的方法,有助于实现多语言大模型中更鲁棒的分词器迁移。 Abstract: Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

[11] A Stylometric Application of Large Language Models

Harrison F. Stropkay,Jiayi Chen,Mohammad J. Latifi,Daniel N. Rockmore,Jeremy R. Manning

Main category: cs.CL

TL;DR: 本文展示了大型语言模型(如GPT-2)可以通过在特定作者的作品上从头训练,有效区分不同作者的写作风格,并验证了R. P. Thompson对《绿野仙踪》第15本书的作者身份。

Details Motivation: 探索大型语言模型是否能够捕捉并识别个体作者的独特写作风格,从而用于作者归属分析。 Method: 使用从零开始在单一作者作品上训练的GPT-2模型,评估其对同作者与其他作者保留文本的预测准确性,通过困惑度等指标进行比较。 Result: 实验表明,针对特定作者训练的模型对其文本的预测显著优于其他作者;该方法成功支持R. P. Thompson为《绿野仙踪》第15本书的真实作者。 Conclusion: 大型语言模型能够有效建模个体作者的写作风格,具备在作者识别和文学考证中应用的潜力。 Abstract: We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one author, will predict held-out text from that author more accurately than held-out text from other authors. We suggest that, in this way, a model trained on one author's works embodies the unique writing style of that author. We first demonstrate our approach on books written by eight different (known) authors. We also use this approach to confirm R. P. Thompson's authorship of the well-studied 15th book of the Oz series, originally attributed to F. L. Baum.

[12] Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Havva Alizadeh Noughabi,Julien Serbanescu,Fattane Zarrinkalam,Ali Dehghantanha

Main category: cs.CL

TL;DR: 本文探讨了利用社会科学研究中的说服理论来构建对抗性提示,以绕过大型语言模型(LLM)的对齐限制。实验表明,具有说服力结构的提示能显著突破安全防护,揭示了LLM在面对 persuasion-aware 攻击时的脆弱性,并指出跨学科方法对提升LLM安全的重要性。

Details Motivation: 现有研究缺乏对影响LLM易受‘越狱’攻击的语言和心理机制的关注,本文旨在从语言和心理学角度探究LLM对说服性策略的响应机制。 Method: 基于社会科学中成熟的说服理论,设计具有说服结构的对抗性提示,并在多个对齐的LLM上进行实证评估;同时分析LLM在越狱回应中是否表现出独特的说服性特征。 Result: 实验结果显示,采用说服性结构的提示能显著绕过LLM的安全保护机制,成功诱导出越狱行为;此外,不同LLM在回应中展现出可识别的说服性指纹。 Conclusion: LLM对源于人类文本训练的说服结构敏感,容易受到基于心理机制的攻击,这表明在提升模型安全性时应引入跨学科视角,特别是语言学与社会心理学的洞见。 Abstract: Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.

[13] Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Sarah Ball,Niki Hasrati,Alexander Robey,Avi Schwarzschild,Frauke Kreuter,Zico Kolter,Andrej Risteski

Main category: cs.CL

TL;DR: 本文研究了基于离散优化的越狱攻击中后缀在不同提示和模型间的可迁移性,提出了三个与迁移成功强相关的统计特性,并通过实验验证了这些特性可用于提升攻击成功率。

Details Motivation: 尽管越狱后缀的可迁移性已被广泛观察到,但其发生的原因和条件缺乏系统分析,本文旨在填补这一空白。 Method: 通过分析提示语在无后缀时对模型拒绝方向的激活程度、后缀对拒绝方向的推动强度以及在正交方向上的变化幅度这三个统计属性,研究其与迁移成功率的关系,并进行干预实验验证。 Result: 发现上述三个统计属性与迁移成功显著相关,而提示语的语义相似性仅弱相关;利用这些发现可在实际攻击中提升成功率。 Conclusion: 越狱后缀的可迁移性主要受模型内部表示动态的影响,而非表面语义,这为理解和防御此类攻击提供了新的视角。 Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.

[14] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang,Wenda Xu,Zhongtao Liu,Tetsuji Nakagawa,Markus Freitag

Main category: cs.CL

TL;DR: 本文研究了机器翻译中质量估计(QE)指标的长度偏差问题,发现现有QE指标在处理长翻译时倾向于高估错误,并偏好较短的翻译结果。为缓解此问题,提出了训练中的长度归一化和评估中引入参考译文两种策略,均有效减少了长度偏差。

Details Motivation: 质量估计指标在无参考译文评估和强化学习等任务中至关重要,但其存在的长度偏差问题尚未被充分研究,可能影响实际应用中的公平性和决策效果。 Method: 通过对10个不同语言对上的高性能回归型和基于大模型判断的QE指标进行系统性研究,分析其在不同翻译长度下的表现,并提出长度归一化和引入参考文本两种缓解策略。 Result: 发现QE指标普遍存在两种长度偏差:随翻译长度增加而过度预测错误,以及在多个候选翻译中偏好较短者;所提出的两种策略能有效减轻这些偏差。 Conclusion: 长度偏差是当前QE指标的重要缺陷,需引起重视;通过训练和评估阶段的改进可显著降低该偏差,提升QE指标的公平性与可靠性。 Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.

[15] ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

Shayne Longpre,Sneha Kudugunta,Niklas Muennighoff,I-Hung Hsu,Isaac Caswell,Alex Pentland,Sercan Arik,Chen-Yu Lee,Sayna Ebrahimi

Main category: cs.CL

TL;DR: 本研究进行了迄今为止最大规模的多语言扩展定律研究,提出了新的自适应迁移扩展定律(ATLAS),揭示了多语言学习动态、语言间迁移特性及多语言诅咒问题,为超越英语的模型高效扩展提供了科学基础。

Details Motivation: 现有扩展定律研究主要集中于英语,但主流AI模型服务于大量非英语用户,亟需对多语言环境下的扩展规律进行系统研究。 Method: 通过774次多语言训练实验,涵盖10M到8B参数的模型、400多种训练语言和48种评估语言,提出ATLAS扩展定律,并构建跨语言迁移矩阵,分析语言间的相互影响与最优扩展策略。 Result: ATLAS在样本外泛化上显著优于现有扩展定律(R²提升超0.3);构建了38×38语言对的迁移矩阵;提出了语言无关的扩展法则;确定了从头预训练与微调多语言检查点的计算交叉点。 Conclusion: 该研究为多语言模型的高效扩展提供了理论支持和实践指导,推动了非英语语言在AI扩展中的平等地位。 Abstract: Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models -- beyond English-first AI.

[16] Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models

Benjamin Reichman,Adar Avsian,Larry Heck

Main category: cs.CL

TL;DR: 该研究发现大语言模型内部存在一个低维的情感流形,情感以方向性方式编码,跨层分布且与可解释维度对齐,具有跨语言和跨数据集的稳定性,表明存在通用的情感子空间,并可通过干预模块进行情感调控。

Details Motivation: 探究大语言模型如何在内部表示情感,揭示其情感表征的几何结构及其普遍性和可控性。 Method: 通过分析隐藏状态空间的几何特性,识别情感流形,使用线性探针和跨域对齐评估情感表示的稳定性和通用性,并设计干预模块实现情感调控。 Result: 发现了稳定且跨语言通用的低维情感子空间,情感表征具有方向性、分布性和可解释性,线性探针性能强,跨域对齐误差低,并实现了对基本情感的有效调控。 Conclusion: 大语言模型内部存在一致且可操控的情感几何结构,揭示了其处理情感的内在机制,为理解模型的情感表征提供了新视角。 Abstract: This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. The paper identifies a low-dimensional emotional manifold and shows that emotional representations are directionally encoded, distributed across layers, and aligned with interpretable dimensions. These structures are stable across depth and generalize to eight real-world emotion datasets spanning five languages. Cross-domain alignment yields low error and strong linear probe performance, indicating a universal emotional subspace. Within this space, internal emotion perception can be steered while preserving semantics using a learned intervention module, with especially strong control for basic emotions across languages. These findings reveal a consistent and manipulable affective geometry in LLMs and offer insight into how they internalize and process emotion.

[17] Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

Atij Mahesh

Main category: cs.CL

TL;DR: 本文比较了六种用于减轻大语言模型性别偏见的控制技术,发现监督微调(SFT)在满足复合约束方面表现最佳,而基于偏好的方法(如DPO)因无法建模逻辑合取而效果差,表明显式正向监督对公平且流畅的生成至关重要。

Details Motivation: 大语言模型在职业中性语境下仍表现出性别刻板印象,现有去偏方法的有效性和学习机制尚不明确,因此需要系统比较不同技术的性能与权衡。 Method: 评估六种去偏方法:仅提示、生成后过滤、基于DFA的Ctrl-G解码、监督微调(SFT)、直接偏好优化(DPO)和迭代零空间投影(INLP),在包含20个Winogender派生职业的复合约束任务上进行实验,衡量约束符合率、词汇多样性和流畅性。 Result: SFT达到99.87±0.15%的合规率且保持高词汇多样性;DPO尽管训练稳定,但仅4.53±0.82%合规;Ctrl-G虽保证完全合规,但显著降低流畅性和多样性;基于偏好的方法无法满足复合约束,因其偏好信号只能编码排序而非逻辑合取。 Conclusion: 显式正向监督(如SFT)是缓解复合偏见的关键,而基于偏好的对齐方法无法泛化逻辑结构,揭示了其在复杂约束任务中的局限性,强调了在可控生成中使用明确监督的必要性。 Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.

[18] Generalization or Memorization: Dynamic Decoding for Mode Steering

Xuanming Zhang

Main category: cs.CL

TL;DR: 提出了一种基于信息瓶颈原理的统一框架,通过动态模式引导(DMS)在推理时区分并控制大语言模型中的泛化与记忆行为,提升其逻辑一致性和事实准确性。

Details Motivation: 大语言模型在实际应用中表现出不可预测的泛化与逐字记忆并存的问题,影响了其在高风险场景下的可靠性。 Method: 基于信息瓶颈原则建立理论模型,将泛化定义为学习压缩且任务相关的表示,记忆则视为压缩失败;在此基础上设计动态模式引导(DMS),包括一个轻量级线性探针用于检测记忆依赖,以及动态激活引导机制推动模型使用泛化回路。 Result: 在推理和事实一致性任务上的实验表明,DMS显著提高了模型的逻辑一致性和事实准确性。 Conclusion: DMS提供了一种有原则的方法来增强大语言模型的可靠性,实现了对不同推理模式的识别与控制。 Abstract: Large Language Models (LLMs) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermines their reliability in high-stakes applications. In this work, we propose a unified framework to understand, identify, and control these distinct reasoning modes. First, we introduce a theoretical model based on the Information Bottleneck (IB) principle, formalizing generalization as the learning of a compressed, task-relevant representation and memorization as a failure to compress. Building on this theory, we develop Dynamic Mode Steering (DMS), a novel inference-time algorithm which comprises two components: (1) a lightweight, causally-grounded linear probe that identifies the model's instantaneous reliance on memorization, and (2) a dynamic activation steering mechanism that nudges the model's computation towards pre-identified generalization circuits. We frame DMS as a form of adaptive, self-contrastive decoding. Experiments on reasoning and faithfulness tasks demonstrate that DMS significantly improves logical consistency and factual accuracy, thereby offering a principled approach to enhancing LLM reliability.

[19] Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

Billy Dickson,Zoran Tiganj

Main category: cs.CL

TL;DR: 提出一种基于输入表示的对数压缩方法,用于增强Transformer的长程记忆能力,而无需修改其架构。

Details Motivation: 现有长上下文处理方法通常通过增加模型复杂性(如引入循环结构或辅助记忆模块)来实现,但会带来计算开销和训练难度;本文旨在探索不改变Transformer架构的前提下提升其长程依赖建模能力的新途径。 Method: 受人类认知记忆模型启发,对输入token采用尺度不变的对数压缩方法,将压缩后的序列输入标准的未修改Transformer模型进行处理。 Result: 在WikiText-103和PG-19语言建模基准上验证了该方法的有效性,相比未压缩基线显著降低了困惑度,且随着压缩后上下文长度增加,性能持续提升。 Conclusion: 输入级的对数压缩是一种简单而有效的扩展Transformer长时记忆的方式,在保持模型架构简洁的同时提升了长上下文建模能力。 Abstract: Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer's long-range memory.

[20] Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

Ling-Team,Ang Li,Ben Liu,Binbin Hu,Bing Li,Bingwei Zeng,Borui Ye,Caizhi Tang,Changxin Tian,Chao Huang,Chao Zhang,Chen Qian,Chenchen Ju,Chenchen Li,Chengfu Tang,Chili Fu,Chunshao Ren,Chunwei Wu,Cong Zhang,Cunyin Peng,Dafeng Xu,Daixin Wang,Dalong Zhang,Dingnan Jin,Dingyuan Zhu,Dongke Hu,Fangzheng Zhao,Feifan Wu,Feng Zhu,Gangshan Wang,Haitao Zhang,Hailin Zhao,Hanxiao Zhang,Hanzi Wang,Hao Qian,Haoyi Yu,Heng Zhang,Hongliang Zhang,Hongzhi Luan,Huirong Dong,Huizhong Li,Jia Li,Jia Liu,Jialong Zhu,Jian Sha,Jianping Wei,Jiaolong Yang,Jieyue Ma,Jiewei Wu,Jinjing Huang,Jingyun Tian,Jingyuan Zhang,Jinquan Sun,Juanhui Tu,Jun Liu,Jun Xu,Jun Zhou,Junjie Ou,Junpeng Fang,Kaihong Zhang,Kaiqin Hu,Ke Shi,Kun Tang,Kunlong Chen,Lanyin Mei,Lei Liang,Lei Xu,Libo Zhang,Lin Ju,Lin Yuan,Ling Zhong,Lintao Ma,Lu Liu,Lu Yu,Lun Cai,Meiqi Zhu,Mengying Li,Min Chen,Minghao Xue,Minghong Cai,Mingming Yin,Peijie Jiang,Peilong Zhao,Pingping Liu,Qian Zhao,Qing Cui,Qingxiang Huang,Qingyuan Yang,Quankun Yu,Shaowei Wei,Shijie Lian,Shoujian Zheng,Shun Song,Shungen Zhang,Shuo Zhang,Siyuan Li,Song Liu,Ting Guo,Tong Zhao,Wanli Gu,Weichang Wu,Weiguang Han,Wenjing Fang,Wubin Wang,Xiang Shu,Xiao Shi,Xiaoshun Lan,Xiaolu Zhang,Xiaqing Sun,Xin Zhao,Xingyu Lu,Xiong Xu,Xudong Wang,Xudong Wang,Xuemin Yang,Yajie Yang,Yang Xiang,Yanzhe Li,Yi Zhang,Yilong Wang,Yingxue Li,Yongzhen Guo,Yuzhuo Fu,Yuanyuan Wang,Yue Yang,Yue Yu,Yufeng Deng,Yun Zhang,Yunfei Xu,Yuqi Zhang,Yuxiao He,Zengke Gui,Zhaoxin Huan,Zhaoyang Wang,Zhibo Zhu,Zhihao Wang,Zhiqiang Zhang,Zhoufei Wang,Zihang Zeng,Ziqi Liu,Zitao Xuan,Zuoli Tang

Main category: cs.CL

TL;DR: Ling 2.0 是一个面向推理的系列语言模型基础,采用高稀疏性的Mixture-of-Experts架构,支持从百亿到万亿参数的扩展,在计算效率和推理能力之间实现了新的帕累托前沿。

Details Motivation: 旨在构建可扩展、高效且专注于推理能力提升的语言模型,解决传统密集模型在大规模下计算效率低的问题。 Method: 采用统一的MoE架构,引入高稀疏性、跨尺度一致性设计,结合MTP、CoT激活、强化微调(DFT, Evo-CoT)和全规模FP8训练等技术,优化推理效率与性能。 Result: 推出了Ling-mini-2.0、Ling-flash-2.0和Ling-1T三个非思考模型,参数量从16B到1T,活动计算效率最高提升7倍;Ling-1T在万亿规模下实现了推理精度与效率的新平衡。 Conclusion: Ling 2.0为未来推理与思维模型的发展提供了连贯、开放且高效的基石,验证了稀疏激活与推理目标对齐时可实现可扩展智能。 Abstract: We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.

[21] OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue

Tianhong Gao,Jundong Shen,Bei Shi,Jiapeng Wang,Ying Ju,Junfeng Yao,Jiao Ran,Yong Zhang,Lin Dong,Huiyu Yu,Tingting Ye

Main category: cs.CL

TL;DR: 本文提出了OlaMind,一种用于检索增强对话的类人且防幻觉的智能客服框架,通过学习人类专家的推理过程和响应策略,并结合监督微调与强化学习进行自我优化,在真实社交客服场景中显著提升了智能解决率并降低了人工接管率。

Details Motivation: 现有的基于检索增强生成(RAG)的智能客服系统容易产生幻觉和机械式回复,影响用户体验并带来业务风险,尤其是在Web端客户服务中。 Method: OlaMind包含两个阶段:Learn-to-Think阶段学习人类专家的推理与响应策略;Learn-to-Respond阶段采用冷启动监督微调结合强化学习进行从基础到困难的自我精炼。 Result: 在大规模在线A/B测试中,OlaMind在社区支持和直播互动场景下分别将智能解决率提升28.92%和18.42%,人工接管率降低6.08%和7.12%。 Conclusion: OlaMind能有效提升智能客服的自然性和安全性,显著降低幻觉和业务风险,在多种实际应用场景中表现出稳定有效性。 Abstract: Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkable improvements in automation and efficiency. However, notable limitations still remain: these systems are prone to hallucinations and often generate rigid, mechanical responses, which can introduce business risks and undermine user experience, especially in Web-based customer service interactions under the RAG scenarios. In this paper, we introduce OlaMind, a human-like and hallucination-safe customer service framework for retrieval-augmented dialogue. Specifically, it first leverages a Learn-to-Think stage to learn the reasoning processes and response strategies from human experts, and then employs a Learn-to-Respond stage to perform cold-start supervised fine-tuning (SFT) combined with reinforcement learning (RL) for basic-to-hard self-refinement. Our method significantly enhances human-likeness and naturalness while effectively mitigating hallucinations and critical business risks. We have conducted large-scale online A/B experiments in an industry-level social customer service setting, and extensive experimental results show that OlaMind achieves significant cumulative relative improvements with intelligent resolution rates +28.92%/+18.42% and human takeover rate -6.08%/-7.12% in community-support/livestream-interaction scenarios, respectively, which highlights its consistent effectiveness across diverse real-world applications. The code and data will be publicly available.

[22] SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language

Rahul Ranjan,Mahendra Kumar Gurve,Anuj,Nitin,Yamuna Prasad

Main category: cs.CL

TL;DR: 本文介绍了首个用于解释性情感分析的Maithili语数据集,包含3,221个带有情感极性和自然语言理由标注的句子,由语言学专家验证,支持可解释的多语言NLP研究。

Details Motivation: Maithili作为一种低资源语言,在自然语言处理中缺乏足够资源,尤其在情感分析方面缺少细粒度和可解释性标注,亟需构建高质量基准数据集。 Method: 构建了一个包含情感极性和母语撰写解释理由的Maithili数据集,并通过经典机器学习和先进Transformer模型进行实验验证。 Result: 该数据集经过语言专家审核,具有高标签可靠性和上下文保真度,实验证明其在可解释情感分析任务中有效。 Conclusion: 该工作建立了Maithili语可解释情感计算的首个基准,为多语言NLP和可解释AI的发展提供了宝贵资源。 Abstract: Developing benchmark datasets for low-resource languages poses significant challenges, primarily due to the limited availability of native linguistic experts and the substantial time and cost involved in annotation. Given these challenges, Maithili is still underrepresented in natural language processing research. It is an Indo-Aryan language spoken by more than 13 million people in the Purvanchal region of India, valued for its rich linguistic structure and cultural significance. While sentiment analysis has achieved remarkable progress in high-resource languages, resources for low-resource languages, such as Maithili, remain scarce, often restricted to coarse-grained annotations and lacking interpretability mechanisms. To address this limitation, we introduce a novel dataset comprising 3,221 Maithili sentences annotated for sentiment polarity and accompanied by natural language justifications. Moreover, the dataset is carefully curated and validated by linguistic experts to ensure both label reliability and contextual fidelity. Notably, the justifications are written in Maithili, thereby promoting culturally grounded interpretation and enhancing the explainability of sentiment models. Furthermore, extensive experiments using both classical machine learning and state-of-the-art transformer architectures demonstrate the dataset's effectiveness for interpretable sentiment analysis. Ultimately, this work establishes the first benchmark for explainable affective computing in Maithili, thus contributing a valuable resource to the broader advancement of multilingual NLP and explainable AI.

[23] DETECT: Determining Ease and Textual Clarity of German Text Simplifications

Maria Korobeynikova,Alessia Battisti,Lukas Fischer,Yingqiang Gao

Main category: cs.CL

TL;DR: 本文提出了DETECT,首个针对德语自动文本简化(ATS)的综合评估指标,基于合成的大语言模型响应训练,在简洁性、意义保持和流畅性方面显著优于现有指标。

Details Motivation: 现有的德语ATS评估依赖通用指标(如SARI、BLEU、BERTScore),无法充分衡量简化质量;缺乏人类标注语料限制了专用指标的发展。 Method: 借鉴英语LENS框架并适配德语,提出DETECT指标,采用大语言模型生成合成质量评分以构建训练数据,并引入LLM驱动的细化步骤来对齐评分标准与简化需求。 Result: DETECT在与人类判断的相关性上显著优于常用ATS指标,尤其在意义保持和流畅性方面表现突出;同时构建了目前最大的德语人工评估数据集用于验证。 Conclusion: DETECT是首个专为德语ATS设计的综合性自动评估指标,验证了使用合成数据训练评估模型的可行性,为语言可及性任务提供了可迁移的方法指南。 Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.

[24] Estimating the Error of Large Language Models at Pairwise Text Comparison

Tianyi Li

Main category: cs.CL

TL;DR: 提出一种不依赖于真实标签的LLM输出错误测量方法,通过成对文本比较评估LLM的偏好错误率,并识别位置偏差与统一错误率,结合Copeland计分生成文本排序,揭示LLM比较的可扩展性问题,实验表明Claude表现最优。

Details Motivation: 现有方法难以准确评估大语言模型(LLM)在成对文本比较中的输出错误,尤其缺乏对位置偏差和无真实标签情况下的可靠估计,因此需要一种更鲁棒、无需 ground truth 的误差测量方法。 Method: 提出基于重复成对比较的方法,分两种情形:一是假设顺序无关的统一错误率,每对文本两次比较(互换顺序);二是假设存在二元位置偏差,通过多次重复比较估计不同顺序的错误率;利用Copeland计分法构建文本排名以推导错误率,并在六种主流LLM上验证。 Result: 在六种LLM(ChatGPT、Claude、DeepSeek、Gemini、Grok、Qwen)和五类文本输入上获得一致的错误率估计;发现两种位置偏差项相近且接近统一错误率;Claude在错误率和提示鲁棒性方面表现最佳;所提方法优于偏置Bradley-Terry模型和可交换性评分。 Conclusion: 该方法能有效估计LLM在成对比较中的输出错误与位置偏差,揭示当前LLM在该任务中可扩展性有限的问题,为LLM评估提供了无需真实标签的实用工具,且实验结果显示Claude在该任务中综合表现最优。 Abstract: We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs' error at this task.

[25] Evolution of the lexicon: a probabilistic point of view

Maurizio Serva

Main category: cs.CL

TL;DR: 本文分析了Swadesh方法在确定语言时间分离上的局限性,并指出词汇的逐步修改这一随机过程对语言词汇演变有重要影响,考虑该过程可显著提高时间分离估计的精度。

Details Motivation: Swadesh方法的基本假设常因多种污染现象和误判而不现实,且其准确性受概率性限制,需更精确的方法来估计语言间的时间分离。 Method: 通过概率分析,研究词汇替换和词汇渐进修改两种随机过程对语言词汇演化的影响,并评估其对时间分离估计精度的作用。 Result: 发现即使在理想条件下,Swadesh方法的准确性也受限于概率因素;同时证明考虑词汇的逐步修改能显著提升时间分离估计的精度。 Conclusion: 语言词汇的演变不仅受词汇替换影响,还受到渐进修改的重要驱动,综合考虑这两种随机过程可改进语言时间分离的估算。 Abstract: The Swadesh approach for determining the temporal separation between two languages relies on the stochastic process of words replacement (when a complete new word emerges to represent a given concept). It is well known that the basic assumptions of the Swadesh approach are often unrealistic due to various contamination phenomena and misjudgments (horizontal transfers, variations over time and space of the replacement rate, incorrect assessments of cognacy relationships, presence of synonyms, and so on). All of this means that the results cannot be completely correct. More importantly, even in the unrealistic case that all basic assumptions are satisfied, simple mathematics places limits on the accuracy of estimating the temporal separation between two languages. These limits, which are purely probabilistic in nature and which are often neglected in lexicostatistical studies, are analyzed in detail in this article. Furthermore, in this work we highlight that the evolution of a language's lexicon is also driven by another stochastic process: gradual lexical modification of words. We show that this process equally also represents a major contribution to the reshaping of the vocabulary of languages over the centuries and we also show, from a purely probabilistic perspective, that taking into account this second random process significantly increases the precision in determining the temporal separation between two languages.

[26] You Don't Need Prompt Engineering Anymore: The Prompting Inversion

Imran Khan

Main category: cs.CL

TL;DR: 本文提出了名为“Sculpting”的新型受限提示方法,旨在通过减少语义歧义和常识错误来改进标准思维链(CoT)提示。在GSM8K基准上对多个OpenAI模型进行评估,发现存在“提示反转”现象:Sculpting在gpt-4o上优于CoT,但在gpt-5上表现更差,原因是高级模型因约束变为“过度字面化”。研究强调提示策略需与模型能力共同演进,更强模型可能更适合简单提示。

Details Motivation: 标准思维链提示虽提升LLM推理能力,但仍易受语义歧义和错误常识影响。作者希望设计一种更可靠的提示方法,在不引入新错误的前提下进一步提升推理准确性。 Method: 提出Sculpting——一种基于规则、结构化约束的提示方法,限制模型生成路径以避免常见推理错误。在GSM8K数学推理数据集上比较Zero Shot、标准CoT和Sculpting三种策略,在gpt-4o-mini、gpt-4o和gpt-5三个模型上进行实验。 Result: Sculpting在gpt-4o上达到97%准确率,优于标准CoT的93%;但在gpt-5上Sculpting为94.00%,低于CoT的96.36%,出现“提示反转”现象。错误分析显示,约束在高端模型中引发“过度字面化”,即模型过于严格遵循规则而失去灵活性。 Conclusion: 提示策略的有效性随模型能力变化而变化,最优提示应与模型发展协同演进。对于更强大的模型,简单提示可能比复杂约束更有效,提示设计需考虑模型层级,避免将中等模型有效的策略直接迁移到高级模型。 Abstract: Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce "Sculpting," a constrained, rule-based prompting method designed to improve upon standard CoT by reducing errors from semantic ambiguity and flawed common sense. We evaluate three prompting strategies (Zero Shot, standard CoT, and Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5) using the GSM8K mathematical reasoning benchmark (1,317 problems). Our findings reveal a "Prompting Inversion": Sculpting provides advantages on gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00% vs. 96.36% for CoT on full benchmark). We trace this to a "Guardrail-to-Handcuff" transition where constraints preventing common-sense errors in mid-tier models induce hyper-literalism in advanced models. Our detailed error analysis demonstrates that optimal prompting strategies must co-evolve with model capabilities, suggesting simpler prompts for more capable models.

[27] SteerX: Disentangled Steering for LLM Personalization

Xiaoyan Zhao,Ming Yan,Yilun Qiu,Haoting Ni,Yang Zhang,Fuli Feng,Hong Cheng,Tat-Seng Chua

Main category: cs.CL

TL;DR: 提出SteerX,一种基于因果推断的解耦激活引导方法,用于更精准地从用户历史数据中提取偏好驱动信息,提升大模型个性化生成效果。

Details Motivation: 现有激活引导方法依赖全部历史数据计算引导向量,但并非所有内容都反映真实用户偏好,导致个性化信号被稀释或污染。 Method: 基于因果推断理论,估计token级因果效应以识别偏好驱动的token,将其转化为连贯描述,并用于生成解耦的激活引导向量。 Result: 在两个代表性引导基线方法和真实数据集上的实验表明,SteerX能持续提升引导向量质量,增强个性化生成效果。 Conclusion: SteerX通过聚焦真正由偏好驱动的信息,生成更准确的激活引导向量,为高效的大语言模型个性化提供了实用解决方案。 Abstract: Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.

[28] PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Iliass Ayaou,Denis Cavallucci

Main category: cs.CL

TL;DR: 本文提出了PatenTEB,一个包含15项任务、206万样本的专利文本嵌入综合基准,涵盖检索、分类、复述和聚类任务,并设计了领域分层划分和难负例挖掘以应对专利特有挑战。同时开发了patembed模型系列,通过多任务训练在外部验证中实现了多项任务的SOTA性能。

Details Motivation: 现有文本嵌入基准未能充分反映专利领域的特定挑战,如不对称片段到文档匹配和专业术语处理,因此需要构建专门针对专利文本的综合性评估基准。 Method: 构建包含15个任务的PatenTEB基准,采用领域分层数据划分和领域特定难负例挖掘;开发参数规模从67M到344M的patembed模型家族,通过多任务学习和对比学习进行训练,并系统分析多任务训练与领域预训练的影响。 Result: patembed-base在MTEB BigPatentClustering.v2上达到0.494 V-measure(优于先前最佳0.445);patembed-large在DAPFAM上取得0.377 NDCG@100;消融实验表明多任务训练和领域预训练均显著提升外部泛化能力。 Conclusion: PatenTEB有效捕捉专利文本嵌入的关键挑战,patembed模型通过多任务学习实现强泛化能力,验证了领域专用基准与模型设计的重要性。 Abstract: Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.

[29] From Slides to Chatbots: Enhancing Large Language Models with University Course Materials

Tu Anh Dinh,Philipp Nicolas Schumacher,Jan Niehues

Main category: cs.CL

TL;DR: 本研究探讨了如何通过结合大学课程材料(如讲义幻灯片和课堂录音)来提升大语言模型(LLM)在大学计算机科学课程中的问答性能,比较了检索增强生成(RAG)和持续预训练(CPT)两种方法,发现RAG更高效,且多模态RAG(以图像形式输入幻灯片)显著优于纯文本检索。

Details Motivation: 大语言模型在高等教育场景中应用潜力巨大,但其在大学计算机科学课程中的准确回答能力仍有限。课程材料形式多样且结构复杂(如含图表的幻灯片和非正式口语转录),传统文本处理方式难以有效利用,因此需要探索更有效的知识融合策略以提升模型表现。 Method: 研究对比了两种将课程知识融入LLM的方法:检索增强生成(RAG)与持续预训练(CPT)。针对包含视觉元素的幻灯片,进一步提出多模态RAG方法,将检索到的幻灯片内容以图像形式输入生成模型,从而保留公式与布局信息。实验在真实大学计算机科学课程数据上进行性能评估。 Result: 实验结果表明,在课程资料规模较小的情况下,RAG比CPT更有效且高效。特别是在处理讲义幻灯片时,采用图像形式输入的多模态RAG显著优于仅使用文本提取的传统RAG方法,提升了答案准确性。 Conclusion: 对于小规模、多模态的大学课程材料,检索增强生成(尤其是多模态RAG)是提升大语言模型教育应用性能的更优策略。该研究为构建支持教学的AI助手提供了实用方案,并可推广至其他教育领域。 Abstract: Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs still struggle to answer questions accurately within university-level computer science courses. In this work, we investigate how incorporating university course materials can enhance LLM performance in this setting. A key challenge lies in leveraging diverse course materials such as lecture slides and transcripts, which differ substantially from typical textual corpora: slides also contain visual elements like images and formulas, while transcripts contain spoken, less structured language. We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture slides, we further explore a multi-modal RAG approach, where we present the retrieved content to the generator in image form. Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT. Moreover, incorporating slides as images in the multi-modal setting significantly improves performance over text-only retrieval. These findings highlight practical strategies for developing AI assistants that better support learning and teaching, and we hope they inspire similar efforts in other educational contexts.

[30] Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER

Andrei Baroian

Main category: cs.CL

TL;DR: 本文研究了在CADEC语料库上的临床命名实体识别(NER),比较了三种方法:BERT类模型、使用少样本上下文学习的GPT-4o,以及经过监督微调的GPT-4o。结果表明,BERT类模型提升有限,简单的上下文学习优于复杂提示,而监督微调性能最佳(F1≈87.1%),但成本更高。此外,LLM在简化为二分类任务时表现更优。

Details Motivation: 旨在比较不同深度学习和大语言模型在临床文本NER任务中的性能,探索最优方法及其成本效益。 Method: 采用三类方法:BERT-style编码器(如BERT Base、BioClinicalBERT、RoBERTa-large)、GPT-4o结合少样本上下文学习(不同提示设计)、GPT-4o进行监督微调,并在CADEC数据集的五类实体上评估标准NER指标。 Result: RoBERTa-large和BioClinicalBERT相比BERT Base提升有限;简单提示的ICL优于复杂提示;SFT取得最高性能(F1≈87.1%);LLM在二分类简化任务中准确率更高。 Conclusion: 监督微调的GPT-4o在临床NER中表现最好,但成本较高;提示设计应简洁有效;简化任务有助于提升LLM准确性。 Abstract: We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.

[31] Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling

Antal van den Bosch,Ainhoa Risco Patón,Teun Buijse,Peter Berck,Maarten van Gompel

Main category: cs.CL

TL;DR: 提出了一种基于内存的语言模型OLIFANT,作为高效、环保的替代方案,相比GPT-2和GPT-Neo在预测准确率、排放和速度方面表现良好。

Details Motivation: 寻找深度神经网络语言模型的高效且环保的替代方案。 Method: 采用基于内存的方法,实现快速近似的k近邻分类,完全依赖CPU进行训练和推理。 Result: OLIFANT在next-token预测性能上具有对数线性可扩展性,记忆能力强,生态足迹小,延迟低,且模型机制简单透明。 Conclusion: 基于内存的语言模型是一种可行、高效且环境友好的语言建模方法,具备实际应用潜力。 Abstract: We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.

[32] Multilingual Target-Stance Extraction

Ethan Mines,Bonnie Dorr

Main category: cs.CL

TL;DR: 本文介绍了首个跨语言的目标立场提取(TSE)基准,涵盖六种语言,并扩展了原有的TSE流程以适应多语言环境,而无需为每种语言单独建模。模型在多语言任务中表现有限,F1得分为12.78,突显了任务难度及目标预测为主要瓶颈,同时首次展示了TSE的F1分数对目标表述方式的敏感性,为多语言TSE提供了必要的基线。

Details Motivation: 现有目标立场提取(TSE)研究局限于英语,缺乏多语言基准和评估方法,限制了其在更广泛语境中的应用。 Method: 构建覆盖六种语言(加泰罗尼亚语、爱沙尼亚语、法语、意大利语、中文和西班牙语)的多语言TSE数据集,扩展原有TSE流程以支持多语言输入,使用统一模型进行目标识别与立场分类。 Result: 实现了12.78的F1分数,表明多语言TSE任务比单语言更具挑战性,目标识别是主要瓶颈,并发现F1分数对目标的不同表达方式高度敏感。 Conclusion: 该研究建立了多语言TSE的首个基准,揭示了当前方法的局限性,强调需改进目标识别及评估标准,为未来多语言立场分析研究提供了基础资源与方向。 Abstract: Social media enables data-driven analysis of public opinion on contested issues. Target-Stance Extraction (TSE) is the task of identifying the target discussed in a document and the document's stance towards that target. Many works classify stance towards a given target in a multilingual setting, but all prior work in TSE is English-only. This work introduces the first multilingual TSE benchmark, spanning Catalan, Estonian, French, Italian, Mandarin, and Spanish corpora. It manages to extend the original TSE pipeline to a multilingual setting without requiring separate models for each language. Our model pipeline achieves a modest F1 score of 12.78, underscoring the increased difficulty of the multilingual task relative to English-only setups and highlighting target prediction as the primary bottleneck. We are also the first to demonstrate the sensitivity of TSE's F1 score to different target verbalizations. Together these serve as a much-needed baseline for resources, algorithms, and evaluation criteria in multilingual TSE.

[33] FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation

Mohammad Aghajani Asl,Majid Asgari-Bidhendi,Behrooz Minaei-Bidgoli

Main category: cs.CL

TL;DR: 本文提出了FAIR-RAG,一种基于结构化证据评估(SEA)的新型检索增强生成框架,通过迭代精炼循环和显式信息缺口分析,显著提升了复杂多跳问答任务的性能,在HotpotQA等基准上实现了新的SOTA结果。

Details Motivation: 现有RAG方法在处理复杂的多跳查询时,缺乏系统识别和填补证据空白的能力,容易传播噪声或无法获取完整上下文,导致推理不准确。 Method: 提出FAIR-RAG框架,核心是包含结构化证据评估(SEA)模块的迭代精炼循环。SEA将问题分解为所需发现的检查清单,并审计已有证据以识别确认事实和明确的信息缺口;这些缺口指导自适应查询精炼代理生成有针对性的子查询以检索缺失信息,直至证据充分。 Result: 在HotpotQA、2WikiMultiHopQA和MusiQue等多跳QA基准上,FAIR-RAG在统一实验设置下显著优于强基线方法。在HotpotQA上F1得分为0.453,比最强的迭代基线绝对提升8.3个百分点。 Conclusion: 结构化的、以证据驱动并具备显式缺口分析的精炼过程对于实现复杂知识密集型任务中可靠且准确的推理至关重要,FAIR-RAG为此类先进RAG系统提供了有效范式。 Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.

[34] Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models

Fiaz Ahmad,Nisar Hussain,Amna Qasim,Momina Hafeez,Muhammad Usman Grigori Sidorov,Alexander Gelbukh

Main category: cs.CL

TL;DR: 本研究通过将英语讽刺语料库翻译成乌尔都语,检测乌尔都语中的讽刺。结合词嵌入与机器学习模型及微调的大型Transformer模型(如LLaMA 3),在乌尔都语讽刺识别上取得了优异性能,其中LLaMA 3 (8B)达到94.61%的F1分数。

Details Motivation: 乌尔都语等低资源语言在讽刺识别方面面临语法和文化背景差异带来的挑战,缺乏相关研究和数据集,因此需要探索有效的讽刺检测方法。 Method: 将英语讽刺语料库翻译为乌尔都语,使用GloVe和Word2Vec嵌入评估十种机器学习算法,并微调BERT、RoBERTa、LLaMA 2、LLaMA 3和Mistral等Transformer模型进行对比实验。 Result: 梯度提升算法在机器学习模型中表现最佳,F1得分为89.18%;LLaMA 3 (8B)在Transformer模型中表现最优,F1得分达94.61%。 Conclusion: 结合音译技术与现代NLP模型可有效提升乌尔都语这类低资源语言的讽刺检测性能,表明大规模预训练模型在跨语言讽刺识别中具有显著优势。 Abstract: Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.

[35] GigaEmbeddings: Efficient Russian Language Embedding Model

Egor Kolodin,Daria Khomich,Nikita Savushkin,Anastasia Ianina,Fyodor Minkin

Main category: cs.CL

TL;DR: 提出GigaEmbeddings,一种针对俄语的高效文本嵌入框架,通过三阶段训练在ruMTEB基准上达到SOTA性能。

Details Motivation: 现有俄语文本嵌入方法在多任务性能和效率方面存在局限,缺乏统一优化目标和高质量训练数据。 Method: 采用三阶段流程:大规模对比预训练、难负例微调、多任务泛化;结合双向注意力、潜在注意力池化和25%层剪枝的架构创新。 Result: 在ruMTEB基准23项任务中取得69.1的平均分,超越更多参数的强基线模型。 Conclusion: GigaEmbeddings通过层级指令调优和架构优化,在较少参数下实现高性能,为俄语嵌入提供了高效解决方案。 Abstract: We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.

[36] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Yupeng Xie,Zhiyang Zhang,Yifan Wu,Sirong Lu,Jiayi Zhang,Zhaoyang Yu,Jinlin Wang,Sirui Hong,Bang Liu,Chenglin Wu,Yuyu Luo

Main category: cs.CL

TL;DR: 本文提出了VisJudge-Bench,首个用于评估多模态大语言模型(MLLMs)在可视化美学与质量评估能力的综合基准,并设计了专门的评估模型VisJudge,显著缩小了模型与人类判断之间的差距。

Details Motivation: 现有的多模态大语言模型在自然图像美学评估中表现良好,但缺乏针对数据可视化评估的系统性基准,且可视化评估需同时考虑数据编码准确性、信息表达性和视觉美感,现有模型难以满足需求。 Method: 构建了一个包含3,090个真实场景样本的专家标注数据集VisJudge-Bench,覆盖32种图表类型,包括单个可视化、多个可视化和仪表板;在此基础上测试主流MLLMs的表现,并提出专用模型VisJudge进行改进。 Result: 实验显示最先进的MLLM(如GPT-5)在该任务上仍与人类专家有较大差距(MAE为0.551,相关性仅0.429);而VisJudge将MAE降低至0.442(减少19.8%),与人类评分的相关性提升至0.681(提高58.7%)。 Conclusion: VisJudge-Bench为评估MLLM在可视化质量判断上的能力提供了可靠基准,而VisJudge显著提升了自动化评估的性能,更接近人类专家水平。 Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

[37] Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Federica Gamba,Aman Sinha,Timothee Mickus,Raul Vazquez,Patanjali Bhamidipati,Claudio Savelli,Ahana Chattopadhyay,Laura A. Zanella,Yash Kankanampati,Binesh Arakkal Remesh,Aryan Ashok Chandramania,Rohit Agarwal,Chuyuan Li,Ioana Buhnila,Radhika Mamidi

Main category: cs.CL

TL;DR: 本文介绍了CAP数据集,一个用于研究大语言模型在科学文本生成中产生幻觉的多语言资源,涵盖900个科学问题和超过7000个来自16个公开模型的回答,并提供幻觉和流畅性标注。

Details Motivation: 由于大语言模型在科学领域缺乏真正理解、上下文感知有限且倾向于表面泛化,容易产生扭曲事实的幻觉,因此需要专门的数据集来研究和检测这类问题。 Method: 构建了一个覆盖五种高资源和四种低资源语言的跨语言数据集CAP,包含人工整理的科学问题与LLM生成的答案,每个样本均标注了事实性错误(幻觉)和语言流畅性标签。 Result: CAP数据集包含900个科学问题和7000多个LLM生成答案,来自16个公开模型,提供了token序列和对应logits,每个实例均有幻觉和流畅性标注。 Conclusion: CAP数据集的发布有助于推动幻觉检测、多语言LLM评估以及更可靠的科学NLP系统的发展。 Abstract: We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.

[38] CHOIR: Collaborative Harmonization fOr Inference Robustness

Xiangjue Dong,Cong Wang,Maria Teleki,Millennium Bismay,James Caverlee

Main category: cs.CL

TL;DR: 提出CHOIR框架,利用不同角色设定的推理信号协同解码,提升大模型推理鲁棒性,无需训练且在多种场景下显著提效。

Details Motivation: 探索角色设定中细微人口统计变化引发的推理差异,将其视为可利用资源而非偏差,以增强大语言模型的推理鲁棒性。 Method: 提出CHOIR框架,在测试时通过协作解码机制融合多个反事实角色的推理路径,动态平衡一致性与多样性,生成统一预测。 Result: 在多个推理基准上验证,CHOIR在不同人群、模型结构、规模和任务中均提升性能,个体群体最高提升26.4%,五类人群平均提升19.2%,即使基础角色设定不佳仍有效。 Conclusion: 将角色变异视为建设性信号,CHOIR为大语言模型提供了可扩展且通用的可靠推理增强方法。 Abstract: Persona-assigned Large Language Models (LLMs) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun changes, can alter reasoning trajectories, leading to divergent sets of correct answers. Instead of treating these variations as biases to be mitigated, we explore their potential as a constructive resource to improve reasoning robustness. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes multiple persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative decoding process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths. Experiments on various reasoning benchmarks demonstrate that CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks - without additional training. Improvements reach up to 26.4% for individual demographic groups and 19.2% on average across five demographics. It remains effective even when base personas are suboptimal. By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable LLM reasoning.

[39] The Tonogenesis Continuum in Tibetan: A Computational Investigation

Siyu Liang,Zhaxi Zerong

Main category: cs.CL

TL;DR: 提出一种计算方法,通过测量音高变化对自动语音识别性能的影响,量化音高在声调化过程中的功能作用。

Details Motivation: 传统上通过比较重建和声学语音学研究声调化过程,但缺乏对音高功能角色的精细量化。 Method: 利用自动语音识别模型,分析一系列藏语方言在音高平坦化处理下的敏感性差异。 Result: 发现声调化连续体:安多方言最耐受去音高,卫藏方言表现严重退化,康巴方言介于两者之间。 Conclusion: 计算方法能捕捉语音演变的细微阶段,传统基于最小对立对的功能负载指标可能高估过渡系统中音高的依赖性。 Abstract: Tonogenesis-the historical process by which segmental contrasts evolve into lexical tone-has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal U-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.

[40] Frustratingly Easy Task-aware Pruning for Large Language Models

Yuanhe Tian,Junjie Liu,Xican Yang,Haishan Ye,Yan Song

Main category: cs.CL

TL;DR: 提出一种结合通用和任务特定校准数据的LLM剪枝方法,通过融合重要性得分保留任务特定能力。

Details Motivation: 现有剪枝方法主要关注保持语言模型生成流畅句子的能力,而忽视了在特定任务上的表现。 Method: 分析传统剪枝方法在通用校准下的损失扰动最小化,并引入任务特定特征分布来改进重要性计算;分别计算通用和任务特定的重要性得分,根据激活范数差异划分参数组,融合得分指导剪枝。 Result: 实验表明该方法在多个基准上一致优于基线方法,能在相同剪枝比例下更好保留任务特定性能。 Conclusion: 所提框架能有效集成到基础剪枝技术中,在压缩LLM的同时保留其专业化能力。 Abstract: Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often ranks the importance of LLM parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing LLMs' size. However, these approaches primarily focus on preserving the LLM's ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective pruning approach for LLMs that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional pruning minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing pruning algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the pruning process. This design enables our method to integrate seamlessly with various foundation pruning techniques and preserve the LLM's specialized abilities under compression. Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical pruning ratios and different settings.

[41] The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Siyu Liang,Nicolas Ballier,Gina-Anne Levow,Richard Wright

Main category: cs.CL

TL;DR: 研究了Whisper模型在49种语言中的子词解码行为,发现子词的发现数量与训练数据量关系不大,而受语言的统计、类型和正字法结构影响更大。提出了“声学饱和时间(AST)”概念,表明增加音频输入对新子词激活的效果存在收敛阈值。

Details Motivation: 探究多语言ASR模型中不同语言的训练数据差异是否影响其推理时子词单元的使用,并理解需要多少音频才能充分观察模型学到的子词库存。 Method: 通过分析Whisper模型在49种语言上的推理过程,记录解码候选子词并追踪其累积发现情况,拟合子词发现率曲线,分析秩频分布及平均子词长度等指标。 Result: 子词总数基本不受预训练数据量影响;子词发现率呈指数饱和趋势,提出声学饱和时间(AST);秩频分布更符合Zipf-Mandelbrot律;平均子词长度与资源水平正相关;拉丁字母语言的表现优于西里尔文、汉字、闪米特语系等文字系统。 Conclusion: 多语言ASR推理中子词的使用更多受限于语音的统计、类型学和正字法结构,而非训练数据规模,为更公平的语料库构建和跨语言评估提供了实证基础。 Abstract: How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.

[42] A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus

Michael Scott,Siyu Liang,Alicia Wassink,Gina-Anne Levow

Main category: cs.CL

TL;DR: 该研究系统评估了四种主流商业语音识别系统中的种族偏见,发现方言性语音变异(特别是元音质量差异)是导致非裔美国人群体识别错误率较高的主要原因。

Details Motivation: 旨在揭示商业自动语音识别系统在不同种族群体间的性能差异及其语言学根源,尤其是社会语音变异对识别准确率的影响。 Method: 使用太平洋西北英语语料库,分析四个族群(非裔、欧裔、奇卡诺裔和雅卡马族)说话者的转录准确率,并提出基于社会语音标注的音素错误率(PER)指标,考察11个社会语音特征对识别错误的影响。 Result: 所有系统对非裔美国说话者的识别错误率最高,元音质量变异(如低元音合并和鼻音前合并模式)与族裔相关的错误率差异显著相关。 Conclusion: 声学模型对方言语音变异的建模不足是商业ASR系统种族偏见的主要来源,需在训练数据中更好表征社会语音多样性以提升公平性。 Abstract: This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems using the Pacific Northwest English (PNWE) corpus. We analyze transcription accuracy across speakers from four ethnic backgrounds (African American, Caucasian American, ChicanX, and Yakama) and examine how sociophonetic variation contributes to differential system performance. We introduce a heuristically-determined Phonetic Error Rate (PER) metric that links recognition errors to specific linguistically motivated variables derived from sociophonetic annotation. Our analysis of eleven sociophonetic features reveals that vowel quality variation, particularly resistance to the low-back merger and pre-nasal merger patterns, is systematically associated with differential error rates across ethnic groups, with the most pronounced effects for African American speakers across all evaluated systems. These findings demonstrate that acoustic modeling of dialectal phonetic variation, rather than lexical or syntactic factors, remains a primary source of bias in commercial ASR systems. The study establishes the PNWE corpus as a valuable resource for bias evaluation in speech technologies and provides actionable guidance for improving ASR performance through targeted representation of sociophonetic diversity in training data.

[43] Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection

Noshitha Padma Pratyusha Juttu,Sahithi Singireddy,Sravani Gona,Sujal Timilsina

Main category: cs.CL

TL;DR: 本研究系统评估了全量微调、参数高效微调(LoRA/QLoRA)和零样本提示在服务条款中不公平条款检测中的表现,发现全量微调在精度和召回上最均衡,而LoRA方法在显著降低内存消耗的同时具有竞争力的召回率。

Details Motivation: 大型语言模型在法律领域的适应性受限于全量微调的高成本,因此需要探索更高效的适配方法。 Method: 对BERT和DistilBERT进行全量微调,应用4位LoRA于TinyLlama、LLaMA 3B/7B和SaulLM等模型,并在零样本设置下评估GPT-4o及其O版本。 Result: 在CLAUDETTE-ToS和多语言爬虫语料库上的实验表明,全量微调具有最佳的精度-召回平衡,LoRA方法召回率接近但内存成本降低达3倍。 Conclusion: 研究揭示了法律领域语言模型适配中的效率与性能权衡,提供了开源基线,推动法律文本处理中的微调研究。 Abstract: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents, a key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost. These findings highlight practical design trade-offs for efficient and domain-adapted LLMs, contributing open baselines for fine-tuning research in legal text processing.

[44] LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Ziyuan He,Yuxuan Wang,Jiaqi Li,Kexin Liang,Muhan Zhang

Main category: cs.CL

TL;DR: 本文提出了LooGLE v2,一个用于评估大语言模型在真实场景中长上下文理解能力的新基准,涵盖法律、金融、游戏和代码等领域,包含10种特定领域的长依赖任务和1934个问答实例。评估结果显示,即使表现最好的模型整体得分也只有59.2%,揭示了当前大模型在处理实际长依赖任务时的显著局限性。

Details Motivation: 尽管大语言模型的上下文窗口不断扩展,但其在真实应用场景中的长依赖理解能力仍存在明显不足,且缺乏针对性的评测基准。 Method: 构建了一个包含16k到2M token的真实长文本数据集,设计了10种领域特定的长依赖任务,并通过可扩展的数据处理流程生成1934个多样化且复杂的问答样本,对6个本地部署和4个API-based的大模型进行了全面评估。 Result: 评估结果显示,当前主流大语言模型在该基准上的整体表现较差,最佳模型仅获得59.2%的得分,表明其实际能有效理解的上下文长度远低于宣称的窗口长度。 Conclusion: 现有大语言模型在处理真实世界长上下文任务时存在严重局限,亟需改进其长距离依赖理解和实际应用能力。 Abstract: Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.

[45] SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

Jinhan Chen,Jianchun Liu,Hongli Xu,Xianjun Gao,Shilong Wang

Main category: cs.CL

TL;DR: SABlock是一种语义感知的KV缓存驱逐框架,通过语义分段和自适应块大小优化压缩效率与语义完整性,在长上下文场景下显著降低内存占用并提升推理速度。

Details Motivation: 现有的KV缓存压缩方法在语义连贯性和内存效率之间难以平衡,且压缩边界常与语言结构不一致,导致关键信息丢失。 Method: SABlock首先进行语义分割以对齐语言结构,然后采用分段引导的令牌评分机制优化重要性评估,最后通过预算驱动的搜索策略为每个段自适应地确定最优块大小。 Result: 在长上下文基准测试中,SABlock在相同内存预算下优于现有最先进方法;在NIAH任务中,仅用96个KV项即达到99.9%准确率(全缓存需8K),在128K上下文长度下峰值内存减少46.28%,解码速度最高提升9.5倍。 Conclusion: SABlock有效平衡了KV缓存压缩中的语义保持与内存效率,显著提升了长上下文LLM推理的可扩展性。 Abstract: The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.

[46] A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback

Zhifeng Wang,Xinyue Zheng,Chunyan Zeng

Main category: cs.CL

TL;DR: 本文提出了一种端到端的个性化学习代理EduLoop-Agent,集成了神经认知诊断模型、自适应测试策略和大语言模型,形成“诊断-推荐-反馈”闭环框架,有效提升学习个性化与效率。

Details Motivation: 现有个性化学习方法通常将建模、题目选择和反馈孤立处理,导致学生模型粗糙、适应性受限且反馈缺乏针对性,亟需一个闭环系统实现精细化、可解释且高效的个性化学习。 Method: 提出EduLoop-Agent,包含三个核心组件:Neural Cognitive Diagnosis(NCD)用于知识点级别的 mastery 评估;BECAT策略动态选择最相关的题目以优化学习效率;大语言模型(LLMs)将诊断结果转化为结构化、可操作的反馈,形成闭环的“诊断-推荐-反馈”流程。 Result: 在ASSISTments数据集上的实验表明,NCD在回答预测和可解释性方面表现优异;BECAT提升了题目的相关性和个性化程度;LLM生成的反馈能精准指导学习。整体系统实现了高效、可部署的个性化学习路径生成。 Conclusion: EduLoop-Agent通过整合诊断、推荐与反馈模块,构建了有效的闭环个性化学习框架,具有良好的实际应用前景,为智能教育中的个体化学习轨迹生成提供了可行方案。 Abstract: As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback in isolation rather than as a closed loop. This leads to coarse or opaque student models, assumption-bound adaptivity that ignores diagnostic posteriors, and generic, non-actionable feedback. To address these limitations, this paper presents an end-to-end personalized learning agent, EduLoop-Agent, which integrates a Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation Computerized Adaptive Testing strategy (BECAT), and large language models (LLMs). The NCD module provides fine-grained estimates of students' mastery at the knowledge-point level; BECAT dynamically selects subsequent items to maximize relevance and learning efficiency; and LLMs convert diagnostic signals into structured, actionable feedback. Together, these components form a closed-loop framework of ``Diagnosis--Recommendation--Feedback.'' Experiments on the ASSISTments dataset show that the NCD module achieves strong performance on response prediction while yielding interpretable mastery assessments. The adaptive recommendation strategy improves item relevance and personalization, and the LLM-based feedback offers targeted study guidance aligned with identified weaknesses. Overall, the results indicate that the proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.

[47] Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems

Kaushal Kumar Maurya,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 本文讨论了人工智能教育(AIED)领域中智能辅导系统(ITS)的评估挑战,并提出了基于学习科学原则的三个可行研究方向,以建立公平、统一和可扩展的评估方法。

Details Motivation: 由于缺乏可靠、普遍接受且以教学法为导向的评估框架,当前大语言模型驱动的智能辅导系统的进展和影响难以追踪。 Method: 通过回顾现有评估实践,结合真实世界案例研究,分析当前教育对话式ITS评估中的问题,并基于跨学科AIED研究的已有成果提出改进建议。 Result: 揭示了当前ITS评估依赖主观协议和非标准化基准所导致的不一致性和泛化能力有限的问题。 Conclusion: 提出了三个植根于学习科学原则的未来研究方向,旨在推动形成更公正、统一和可扩展的ITS评估体系。 Abstract: The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from mainstream ITS development and provide comprehensive state-of-the-art evaluation practices, highlighting associated challenges through real-world case studies from careful and caring AIED research. Finally, building on insights from previous interdisciplinary AIED research, we propose three practical, feasible, and theoretically grounded research directions, rooted in learning science principles and aimed at establishing fair, unified, and scalable evaluation methodologies for ITSs.

[48] AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment

Dario Loi,Elena Maria Muià,Federico Siciliano,Giovanni Trappolini,Vincenzo Crisà,Peter Kruger,Fabrizio Silvestri

Main category: cs.CL

TL;DR: AutoBench是一种通过模型间相互评估来动态评测大语言模型的自动化框架,相比静态基准测试更具适应性和抗污染能力。

Details Motivation: 传统静态基准存在测试集污染和适应性差的问题,需要一种更动态、可持续的评估方法。 Method: AutoBench让模型轮流担任问题生成者、应试者和评判者,通过迭代加权机制聚合多模型的同行评分,形成共识排名。 Result: 实验显示AutoBench与MMLU-Pro和GPQA等基准有较高相关性(分别为78%和63%),且多评判者设计优于单评判者基线。 Conclusion: AutoBench提供了一种可扩展、抗污染的动态评估范式,适用于持续评估不断演进的语言模型。 Abstract: We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.

[49] Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance

Mahyar Abbasian,Ramesh Jain

Main category: cs.CL

TL;DR: 提出个人护理公用事业(PCU),一种基于人工智能的终身健康指导系统,通过整合多模态数据和实时分析,提供个性化健康信息、行为引导和治疗响应解读。

Details Motivation: 在数字基础设施和生物医学创新的基础上,解决传统间歇性医疗护理的局限,实现持续、个性化的健康管理。 Method: 采用多模态代理、以事件为中心的建模和上下文推理,结合个人感知、经验计算和群体级分析,构建PCU系统架构。 Result: PCU能够实时监测、解释并指导日常生活中的健康决策,提升个体健康结果,并为公共卫生和科学发现提供新基础。 Conclusion: PCU代表了一种新兴的健康照护范式,作为环境化、自适应的伴侣,有望变革个人健康管理和公共卫生产能。 Abstract: Building on decades of success in digital infrastructure and biomedical innovation, we propose the Personal Care Utility (PCU) - a cybernetic system for lifelong health guidance. PCU is conceived as a global, AI-powered utility that continuously orchestrates multimodal data, knowledge, and services to assist individuals and populations alike. Drawing on multimodal agents, event-centric modeling, and contextual inference, it offers three essential capabilities: (1) trusted health information tailored to the individual, (2) proactive health navigation and behavior guidance, and (3) ongoing interpretation of recovery and treatment response after medical events. Unlike conventional episodic care, PCU functions as an ambient, adaptive companion - observing, interpreting, and guiding health in real time across daily life. By integrating personal sensing, experiential computing, and population-level analytics, PCU promises not only improved outcomes for individuals but also a new substrate for public health and scientific discovery. We describe the architecture, design principles, and implementation challenges of this emerging paradigm.

[50] PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion

Morteza Alikhani,Mohammadtaha Bagherifard,Erfan Zinvandi,Mehran Sarmadi

Main category: cs.CL

TL;DR: PerCoR是首个大规模波斯语常识推理基准,包含10.6万个多项选择句子补全问题,采用基于连词的分割策略和DRESS-AF方法生成挑战性干扰项,显著提升数据集难度。

Details Motivation: 填补波斯语常识推理基准的空白,推动低资源语言在该领域的研究发展。 Method: 提出基于连词的分割策略生成句子补全对,并设计DRESS-AF(无生成对抗过滤)方法从真实续写中筛选高混淆干扰项。 Result: 人类标注者得分为89%,OpenAI-o3达92.18%,最强开源模型DeepSeek-R1为82.51%;DRESS-AF可迁移到英文HellaSwag并提升其难度。 Conclusion: PerCoR是一个具有挑战性的波斯语常识推理基准,揭示了现有模型的局限性,并为多语言常识推理提供了新资源。 Abstract: We introduced PerCoR (Persian Commonsense Reasoning), the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural, and other web sources. We introduce a novel conjunction-based segmentation strategy to generate coherent sentence-completion pairs, enabling broad topical and structural diversity. To create challenging distractors, we propose DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering), a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset's difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://huggingface.co/datasets/MCINext/PerCoR.

[51] Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Ambalika Guha,Sajal Saha,Debanjan Ballav,Soumi Mitra,Hritwick Chakraborty

Main category: cs.CL

TL;DR: 本文提出了一种结合传统语言学方法与人工智能技术的可持续模式,用于保护濒危的Toto语言,开发了三语(Toto-Bangla-English)学习应用,并构建了基于Unicode和结构化语料库的小型语言模型与翻译引擎。

Details Motivation: 为了挽救濒临灭绝的Toto语言,保留其独特的世界观,并提升该语言在数字时代的可访问性和使用性。 Method: 通过实地调研收集语言数据,建立带有词素标注的三语语料库,开发统一码脚本支持,训练小型语言模型(SLM)和基于Transformer的翻译系统,并设计适用于母语者和非母语者的语言学习应用程序。 Result: 成功创建了Toto语言的结构化三语语料库,实现了脚本标准化,开发了语言学习工具和翻译引擎,提升了Toto文字的数字化水平和社区使用率。 Conclusion: 将传统语言学研究与现代AI技术相结合,能够有效推动濒危语言的数字化存档与复兴,为社区驱动的语言保护提供了可复制的跨学科范例。 Abstract: Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.

[52] Culturally Grounded Physical Commonsense Reasoning in Italian and English: A Submission to the MRL 2025 Shared Task

Marco De Santis,Lisa Alazraki

Main category: cs.CL

TL;DR: 本文介绍了FormaMentis,一个基于意大利语言和文化的物理常识推理新基准,旨在为非英语语言提供多语言物理推理评估数据。

Details Motivation: 为了扩展物理常识推理领域在非英语语言中的评估资源,特别是反映意大利语言与文化背景的数据集。 Method: 由母语为意大利语且熟悉当地习俗的专家标注员创建数据样本,并将其翻译成英语,同时保留意大利特有的文化元素。 Result: 构建了一个名为FormaMentis的新型基准,包含意大利语原始数据及其保留文化特征的英文翻译。 Conclusion: FormaMentis为多语言物理常识推理提供了有价值的资源,支持跨语言和跨文化的研究。 Abstract: This paper presents our submission to the MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets. The objective of the shared task is to create manually-annotated evaluation data in the physical commonsense reasoning domain, for languages other than English, following a format similar to PIQA. Our contribution, FormaMentis, is a novel benchmark for physical commonsense reasoning that is grounded in Italian language and culture. The data samples in FormaMentis are created by expert annotators who are native Italian speakers and are familiar with local customs and norms. The samples are additionally translated into English, while preserving the cultural elements unique to the Italian context.

[53] Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion

Zilong Wang,Qingtian Zeng,Hua Duan,Cheng Cheng,Minghao Zou,Ziyang Wang

Main category: cs.CL

TL;DR: 提出了一种新的少样本知识图谱补全框架CR-FKGC,通过建模共轭关系来提升对复杂关系模式的捕捉能力和缓解数据稀疏性问题。

Details Motivation: 现有方法在处理少样本知识图谱补全时难以捕捉复杂的关系模式且受数据稀疏影响较大。 Method: 采用邻域聚合编码器整合高阶邻居信息,结合隐式条件扩散关系模块和稳定关系模块的共轭关系学习器,以及流形空间中的共轭解码器进行缺失三元组推断。 Result: 在三个基准数据集上的实验表明,该方法优于当前最先进的方法。 Conclusion: CR-FKGC有效提升了少样本知识图谱补全的性能,尤其在建模复杂关系和处理数据稀疏方面表现突出。 Abstract: Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsity. To address these challenges, we propose a novel FKGC framework for conjugate relation modeling (CR-FKGC). Specifically, it employs a neighborhood aggregation encoder to integrate higher-order neighbor information, a conjugate relation learner combining an implicit conditional diffusion relation module with a stable relation module to capture stable semantics and uncertainty offsets, and a manifold conjugate decoder for efficient evaluation and inference of missing triples in manifold space. Experiments on three benchmarks demonstrate that our method achieves superior performance over state-of-the-art methods.

[54] Rule-Based Explanations for Retrieval-Augmented LLM Systems

Joel Rorseth,Parke Godfrey,Lukasz Golab,Divesh Srivastava,Jarek Szlichta

Main category: cs.CL

TL;DR: 提出使用if-then规则解释基于检索增强生成(RAG)的大语言模型(LLM)的新方法,通过优化策略高效生成解释规则。

Details Motivation: 大语言模型(LLM)日益复杂,亟需可解释性方法;RAG引入外部信息源,为通过规则解释输出来源提供了新机会。 Method: 提出一种基于Apriori剪枝思想的优化方法,避免暴力枚举所有信息源组合,高效生成能解释LLM输出的if-then规则。 Result: 实验表明该方法能有效生成有意义的解释规则,并显著提升规则生成效率。 Conclusion: 该工作首次将if-then规则用于解释RAG型LLM,为理解其推理过程提供了实用且高效的可解释性工具。 Abstract: If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first." To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.

[55] SALSA: Single-pass Autoregressive LLM Structured Classification

Ruslan Berdichevsky,Shai Nahum-Gefen,Elad Ben Zaken

Main category: cs.CL

TL;DR: SALSA是一种结合结构化提示、类别到令牌映射和参数高效微调的分类方法,显著提升大语言模型在文本分类任务上的性能。

Details Motivation: 指令调优的大语言模型在文本分类基准上表现不佳,需要更有效的分类方法。 Method: 通过将每个类别标签映射到唯一的输出令牌,使用结构化提示引导模型生成单令牌响应,并在推理时仅将输出投影到相关类别令牌的logits上。 Result: SALSA在多个基准测试中实现了最先进的结果,表现出良好的鲁棒性和可扩展性。 Conclusion: SALSA有效提升了大语言模型在文本分类任务中的准确性和效率,适用于广泛的分类应用。 Abstract: Despite their impressive generalization capabilities, instruction-tuned Large Language Models often underperform on text classification benchmarks. We introduce SALSA, a coherent pipeline that combines structured prompting, class-to-token mapping, and parameter-efficient fine-tuning, thereby avoiding cold-start training. Each class label is mapped to a distinct output token, and prompts are constructed to elicit a single-token response. During inference, the model's output is projected only onto the logits of the relevant class tokens, enabling efficient and accurate classification in a single forward pass. SALSA achieves state-of-the-art results across diverse benchmarks, demonstrating its robustness and scalability for LLM-based classification applications.

[56] $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker

Qi Liu,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Jiaxin Mao

Main category: cs.CL

TL;DR: 本文提出了一种简单而有效的统一框架E²Rank,通过列表式排序目标的持续训练,扩展单个文本嵌入模型以同时实现高质量检索和列表式重排序,从而在保持高效率的同时显著提升重排序性能。

Details Motivation: 现有的文本嵌入模型在检索效率上表现优异,但排序保真度不如基于LLM的列表式重排序器。本文旨在通过统一框架弥补这一差距,兼顾效率与准确性。 Method: 提出E²Rank框架,使用查询与候选文档构成的列表式排序提示作为增强查询,利用余弦相似度作为统一排序函数,在基础嵌入模型上进行列表式排序目标的继续训练。 Result: E²Rank在BEIR重排序基准上达到最先进的性能,在BRIGHT基准上表现出有竞争力的结果,且重排序延迟极低;同时在MTEB基准上提升了嵌入性能。 Conclusion: 单一嵌入模型可有效统一检索与重排序任务,在保持计算高效的同时实现竞争性的排序准确率。 Abstract: Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.

[57] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Eeham Khan,Firas Saidani,Owen Van Esbroeck,Richard Khoury,Leila Kosseim

Main category: cs.CL

TL;DR: 本研究探讨了在数据和计算资源有限的情况下,使用持续预训练(CPT)结合低秩适应(LoRA)技术,将大语言模型适配到低资源方言(魁北克法语)的可行性。实验结果显示,在仅更新不到1%参数的情况下,模型在方言任务上性能提升显著,且对标准语言性能影响极小。研究强调语料组成的重要性,并发布了首个魁北克法语大模型。

Details Motivation: 大语言模型的能力主要集中于高资源语言,低资源方言面临数据匮乏问题,导致语言不平等。本文旨在探索如何以低成本有效提升大模型对少数方言的理解能力,促进语言多样性与包容性。 Method: 采用持续预训练(CPT)结合低秩适应(LoRA)的参数高效微调方法,在极小规模的魁北克法语数据集上对三种大语言模型进行适配,并在COLE基准套件上进行评估。 Result: 所提出的方法在魁北克法语方言基准上取得显著性能提升,同时在标准法语基准上仅有轻微退化。仅更新不到1%的模型参数即实现有效适配,且性能增益高度依赖于训练语料的组成。 Conclusion: 持续预训练结合参数高效微调(如LoRA)是一种成本低、可持续的方案,可有效缩小大模型在主流语言与少数方言之间的差距,推动少数语言群体获得高质量语言模型服务。 Abstract: Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Qu\'ebec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Qu\'ebec French LLMs on HuggingFace.

[58] Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Anooshka Bajaj,Deven Mahesh Mistry,Sahaj Singh Maini,Yash Aggarwal,Zoran Tiganj

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)在上下文学习中如何区分和检索时间上分离的事件,发现模型在预测下一个词时存在对序列开头或结尾附近的词的显著偏好,这种现象与‘归纳头’有关,且在不同架构的模型中均存在时间偏差。

Details Motivation: 理解大语言模型如何像人类 episodic memory 一样,通过时间分离来检索上下文信息,揭示时间与语义因素在上下文学习中的作用。 Method: 通过构造包含重复词元的序列,并固定重复词元的位置、随机排列其他词元,以消除语义干扰,单独分析时间因素对下一词预测的影响;同时进行消融实验并扩展到有部分重叠语义的情境。 Result: 模型普遍更可能预测出现在重复词元之后的词,但表现出对靠近输入序列起始或结尾位置的词的偏好;消融实验表明这种现象与transformer中的归纳头相关;位于提示中间的记忆检索可靠性较低;transformer和状态空间模型表现出相似的时间偏差。 Conclusion: 大语言模型在上下文学习中存在系统性的时间偏差,这种偏差有助于实现类似 episodic memory 的时间分离和事件检索,且不受特定架构限制。 Abstract: In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.

[59] EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Li Zhou,Lutong Yu,You Lyu,Yihang Lin,Zefeng Zhao,Junyi Ao,Yuhao Zhang,Benyou Wang,Haizhou Li

Main category: cs.CL

TL;DR: EchoMind 是首个多层次、相互关联的基准,用于评估语音语言模型在理解语义内容和非词汇性声音线索方面的共情对话能力,揭示了现有模型在处理高表现力语音线索和生成共情回应方面的局限性。

Details Motivation: 现有基准多孤立评估语音模型的语言、声学或推理能力,缺乏对情感智能对话所需综合能力的评估,因此需要一个整合语义理解、声音线索感知和共情回应生成的综合性基准。 Method: 提出 EchoMind 基准,采用语义中立的脚本,通过控制性变化的语音风格测试模型对非词汇性声音线索的感知;任务包括语音内容理解、声音线索识别、综合推理和回应生成,并基于3个粗粒度和12个细粒度共情维度进行客观与主观评估。 Result: 对12种先进语音语言模型的测试表明,即使最先进的模型也难以准确捕捉高表现力的语音线索,在指令遵循、应对自然语音变异性和有效利用声音线索方面存在明显不足,导致共情回应质量受限。 Conclusion: 当前语音语言模型在整合语言内容与多样化声音线索以实现真正共情对话方面仍有显著差距,未来需加强模型对非词汇性语音特征的理解与响应能力。 Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.

[60] Iterative Layer Pruning for Efficient Translation Inference

Yasmin Moslem,Muhammad Hazim Al Farouq,John D. Kelleher

Main category: cs.CL

TL;DR: 本文研究了基于层重要性分析的迭代剪枝方法,用于压缩大型语言模型(如Aya-Expanse-8B),在保持翻译质量的同时显著减小模型规模并降低推理时间。

Details Motivation: 大型语言模型虽然在机器翻译等领域表现出色,但其高计算开销导致部署困难,因此需要有效的模型压缩方法。 Method: 采用基于层重要性分析的迭代层剪枝方法,对Aya-Expanse-8B模型进行压缩,并在捷克语到德语、英语到埃及阿拉伯语的翻译任务上进行评估。 Result: 该方法显著减少了模型大小和推理时间,同时保持了与基线模型相当的翻译质量。 Conclusion: 迭代层剪枝是一种有效的LLM压缩策略,能够在不牺牲性能的前提下提升模型部署效率。 Abstract: Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer pruning guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.

[61] MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion

Haoyi Qiu,Yilun Zhou,Pranav Narayanan Venkit,Kung-Hsiang Huang,Jiaxin Zhang,Nanyun Peng,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本文提出了MMPersuade框架,用于系统研究大型视觉语言模型(LVLMs)在多模态说服内容下的易感性和说服效果,发现多模态输入显著增强说服力,尤其在错误信息场景中,并揭示不同说服策略在不同情境下的有效性差异。

Details Motivation: 随着LVLMs在购物、健康和新闻等领域的广泛应用,其面临大量多模态说服性内容的影响,亟需理解模型作为被说服者的机制与风险,以避免误导性信念、违背用户偏好或生成不安全输出。 Method: 构建了一个涵盖图像和视频的多模态数据集,结合商业、主观行为和对抗性情境中的说服原则,并设计评估框架,通过第三方一致性评分和对话历史中的自估计令牌概率来量化说服效果和模型易感性。 Result: 实验表明:(i) 多模态输入相比纯文本显著提升说服效果和模型易感性,尤其在错误信息场景;(ii) 尽管用户先前偏好可降低易感性,但多模态仍具说服优势;(iii) 不同策略效果因情境而异,互惠性在商业和主观情境中最有效,可信度和逻辑在对抗性情境中占优。 Conclusion: MMPersuade为开发面对说服性多模态内容时更具鲁棒性、偏好一致性和伦理对齐的模型提供了原则性基础。 Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how these models function as persuadees-how and why they can be influenced by persuasive multimodal inputs. Understanding both their susceptibility to persuasion and the effectiveness of different persuasive strategies is crucial, as overly persuadable models may adopt misleading beliefs, override user preferences, or generate unethical or unsafe outputs when exposed to manipulative messages. We introduce MMPersuade, a unified framework for systematically studying multimodal persuasion dynamics in LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs images and videos with established persuasion principles across commercial, subjective and behavioral, and adversarial contexts, and (ii) an evaluation framework that quantifies both persuasion effectiveness and model susceptibility via third-party agreement scoring and self-estimated token probabilities on conversation histories. Our study of six leading LVLMs as persuadees yields three key insights: (i) multimodal inputs substantially increase persuasion effectiveness-and model susceptibility-compared to text alone, especially in misinformation scenarios; (ii) stated prior preferences decrease susceptibility, yet multimodal information maintains its persuasive advantage; and (iii) different strategies vary in effectiveness across contexts, with reciprocity being most potent in commercial and subjective contexts, and credibility and logic prevailing in adversarial contexts. By jointly analyzing persuasion effectiveness and susceptibility, MMPersuade provides a principled foundation for developing models that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.

[62] Scalable Supervising Software Agents with Patch Reasoner

Junjielong Xu,Boyin Tan,Xiaoyuan Liu,Chao Peng,Pengfei Gao,Pinjia He

Main category: cs.CL

TL;DR: 本文提出了R4P,一种基于推理的可扩展补丁验证模型,用于训练和测试软件工程代理,相比传统测试方法更高效且准确。

Details Motivation: 现有基于测试的监督方法在数据扩展上不可扩展,构建和运行测试沙箱成本高且脆弱,高质量覆盖测试数据稀少且易受边缘情况攻击。 Method: 将补丁验证视为推理任务,采用群体目标进行强化学习训练,使R4P能通过对比多个补丁的修改实现密集奖励,提升训练稳定性。 Result: R4P在SWE-bench-verified上达到72.2%的准确率,超过OpenAI o3;驱动的Mini-SE在Pass@1上达到26.2%,比Qwen3-32B提升10.0%,测试时扩展下可达32.8%;验证速度比传统测试快50倍。 Conclusion: R4P提供了一种高效、稳定的可扩展奖励机制,适用于训练SWE代理,具有实际应用潜力。 Abstract: While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.

[63] VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions

Thu Phuong Nguyen,Duc M. Nguyen,Hyotaek Jeon,Hyunwook Lee,Hyunmin Song,Sungahn Ko,Taehwan Kim

Main category: cs.CL

TL;DR: 本文提出了一种用于评估手写数学表达式的视觉-语言模型VEHME,通过两阶段训练和表达式感知的视觉提示模块,在准确性和可解释性方面实现了最先进的性能。

Details Motivation: 自动评估手写数学解题过程在教育技术中具有重要应用,但由于学生书写格式多样、布局非结构化及符号复杂,现有方法难以准确评估。 Method: 提出VEHME模型,结合监督微调和强化学习的两阶段训练,并引入表达式感知的视觉提示模块以增强对多行数学表达式的空间理解。 Result: 在AIHub和FERMAT数据集上,VEHME在开源模型中达到最先进水平,性能接近专有系统。 Conclusion: VEHME能够高效、准确地评估开放格式的手写数学答案,具备可扩展性和可访问性,适用于自动化数学评估场景。 Abstract: Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.

[64] Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Poli Nemkova,Amrit Adhikari,Matthew Pearson,Vamsi Krishna Sadu,Mark V. Albert

Main category: cs.CL

TL;DR: 本研究首次系统比较了商业与开源大语言模型在七种语言中检测人权侵犯的表现,发现模型对齐性而非规模决定了跨语言稳定性,为资源受限的人道主义组织提供了成本与可靠性权衡的实践指导。

Details Motivation: 人道主义组织面临使用昂贵商业API或依赖未经验证的开源模型的困境,尤其在低资源语言环境中缺乏实证支持。 Method: 通过对六个模型(四个指令对齐的商业模型和两个开源模型)在78,000次多语言推理任务中的表现进行评估,采用标准分类指标及新提出的跨语言可靠性指标(如校准偏差、决策偏见、语言鲁棒性和稳定性分数)。 Result: 指令对齐的模型在不同类型和低资源语言中保持高准确性和稳定校准,而开源模型表现出显著的语言敏感性和校准漂移;对齐性是决定跨语言稳定性的关键因素。 Conclusion: 多语言对齐使模型具备语言无关的推理能力,建议人道主义组织优先考虑经过良好对齐的模型以确保多语言部署的可靠性。 Abstract: Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.

[65] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Haowei Hua,Hong Jiao,Xinyi Wang

Main category: cs.CL

TL;DR: 本研究探索了使用生成式语言模型通过摘要和提示来自动评分长篇作文,显著提高了评分准确性。

Details Motivation: 由于BERT等编码器模型对512个token的限制,在长篇作文自动评分中表现不足,因此需要更有效的模型来处理长文本。 Method: 采用生成式语言模型,结合文本摘要和提示技术,对长篇作文进行自动评分。 Result: 在Learning Agency Lab Automated Essay Scoring 2.0数据集上,QWK指标从0.822提升至0.8878,显著提高了评分准确率。 Conclusion: 生成式语言模型结合摘要和提示方法能有效克服长度限制,提升长篇作文自动评分的性能。 Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

[66] Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning

Prerna Ravi,Dong Won Lee,Beatriz Flamia,Jasmine David,Brandon Hanks,Cynthia Breazeal,Emma Anderson,Grace Lin

Main category: cs.CL

TL;DR: 本文研究了如何通过显式对话线索提升大语言模型(LLM)在同步多参与者对话中识别话题线程和关系行为编码的效果,提出了系统性标注指南并评估了不同提示策略,结果表明明确的线程信息能显著提升下游分析性能。

Details Motivation: 理解小组对话中思想的发展与流动对协作学习分析至关重要,而同步口语对话中的话题线程检测因重叠发言和隐含线索而具有挑战性;同时,尽管大语言模型在自动话语分析方面有潜力,但在依赖长上下文和对话关联的任务中表现不佳。 Method: 提出了一套用于识别同步多参与者对话转录本中线程的系统性指南,比较了多种大语言模型的提示策略以实现自动化线程检测,并测试了线程信息对下游协作行为(如同意、构建、引导)编码效果的影响。 Result: 实验结果表明,提供清晰的对话线程信息能够显著提高大语言模型在关系行为编码任务中的表现,同时揭示了下游分析对良好结构化对话的高度依赖性,并分析了时间与成本之间的实际权衡。 Conclusion: 结合明确的话题线程结构与大语言模型可有效提升对复杂实时小组互动的理解,人机协同方法在实际应用中具有最佳性价比,为协作学习分析提供了更可靠的方法支持。 Abstract: Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. At the same time, large language models (LLMs) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve LLM-based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions.

[67] Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Adam Stein,Neelay Velingker,Mayur Naik,Eric Wong

Main category: cs.CL

TL;DR: 本文提出了一种名为Per-Instance Program Synthesis (PIPS)的方法,通过实例级的结构反馈生成和优化程序,无需任务特定指导或显式测试用例,显著提升了大语言模型在复杂多步推理任务中的表现。

Details Motivation: 大语言模型在零样本推理上表现出色,但在复杂、多步推理任务中仍存在挑战。现有方法如Chain of Thought (CoT) 和 Program of Thought (PoT) 虽有改进,但在算法领域常产生不良解。因此需要一种更鲁棒的方法来提升推理质量和减少错误程序生成。 Method: PIPS 方法在实例级别生成并利用结构反馈优化程序,不依赖任务特定指导或显式测试用例;同时引入置信度指标,动态选择直接推断或程序合成路径。 Result: 在三个前沿大语言模型和30个基准(包括BBEH、视觉问答、关系推理和数学推理任务)上的实验表明,PIPS相比PoT和CoT分别将绝对调和平均准确率提高最多8.6%和9.4%,并在算法任务中比PoT减少65.1%的不良程序生成(以Gemini-2.0-Flash为例)。 Conclusion: PIPS通过实例级程序合成与结构反馈机制,有效提升了大语言模型在多步复杂推理任务中的准确性与稳定性,尤其在算法领域显著减少了无效输出,是一种更具鲁棒性的推理增强方法。 Abstract: Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

[68] Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

Linyang He,Tianjun Zhong,Richard Antonello,Gavin Mischler,Micah Goldblum,Nima Mesgarani

Main category: cs.CL

TL;DR: 提出一种残差解耦方法,从大语言模型中分离出词汇、句法、语义和推理的正交表征,并用于预测大脑神经活动,揭示推理相关的独特神经特征。

Details Motivation: 传统语言模型表征高度纠缠,导致脑编码分析偏向浅层语言特征,难以识别深层认知过程(如推理)的神经基础。 Method: 通过探查语言模型识别特定特征层,迭代回归去除低层表征,生成词汇、句法、语义和推理四个近似正交的嵌入,并用这些嵌入建模颅内ECoG数据。 Result: 1) 分离出的推理嵌入具有独特预测能力,能解释其他语言特征无法解释的神经活动变异,并涉及视觉区域;2) 推理的神经信号在时间上更晚(~350-400ms),符合其高层处理地位;3) 标准LLM嵌入的预测力主要来自浅层特征,掩盖深层认知贡献。 Conclusion: 该解耦方法能更准确地揭示语言理解中不同层次认知过程的神经机制,特别是推理的独立神经表征。 Abstract: Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly "entangled," mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.

[69] Interpreting and Mitigating Unwanted Uncertainty in LLMs

Tiasa Singha Roy,Ayush Rajesh Jhaveri,Ilias Triantafyllopoulos

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)中“不良不确定性”现象的机制,即模型在重新提示时将原本正确的答案改为错误答案。通过改进“针在 haystack 中”检索框架并引入 Flip 式重评估提示,发现非检索注意力头而非检索头是导致该问题的关键。屏蔽这些头部可将答案翻转行为减少最多15%,且不引发不连贯或过度纠正,但在下游任务中存在权衡。

Details Motivation: 解决大语言模型在高风险领域因重复提问导致正确答案被错误更改的问题,提升模型可靠性与信任度。 Method: 采用改进的‘针在 haystack 中’检索框架,结合 Flip 风格的重评估提示来模拟真实场景中的答案翻转;通过注意力头分析识别导致不确定性的关键组件,并进行遮蔽实验验证其影响。 Result: 发现一小部分非检索注意力头在不确定上下文中显著关注误导性 token;屏蔽这些头可将答案翻转行为减少最多15%,且未引入输出不连贯或过度纠正现象,但在下游任务中观察到性能权衡。 Conclusion: 不良不确定性主要由特定非检索注意力头驱动,而非检索头主导;提出一种简单有效的方法通过遮蔽这些头来缓解 LLM 的不确定性相关故障,为机制可解释性研究提供了新见解。 Abstract: Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.

[70] A Comprehensive Dataset for Human vs. AI Generated Text Detection

Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Gaytri Jena,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha,Amitava Das

Main category: cs.CL

TL;DR: 本文提出一个包含58,000多个文本样本的大规模数据集,涵盖《纽约时报》真实文章及多种先进大语言模型生成的合成文本,旨在推动AI生成文本检测与模型归属技术的发展。

Details Motivation: 随着大语言模型生成的文本越来越接近人类写作,内容真实性、虚假信息和可信度问题日益突出,亟需可靠的方法来检测和溯源AI生成内容。 Method: 构建了一个包含真实新闻文章和多个SOTA大语言模型(如Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large和GPT-4-o)生成文本的数据集,并基于该数据集对文本检测与模型归属任务进行基准测试。 Result: 在区分人类与AI生成文本的任务上达到58.35%的准确率,在AI生成模型归属任务上准确率为8.92%。 Conclusion: 该数据集结合真实新闻内容与现代生成模型输出,有助于促进鲁棒的AI生成内容检测与溯源方法的发展,提升生成式AI时代的信息透明与信任。 Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.

[71] Batch Speculative Decoding Done Right

Ranran Haoran Zhang,Soumik Dey,Ashirbad Mishra,Hansi Wu,Binbin Li,Rui Zhang

Main category: cs.CL

TL;DR: 本文提出了一种高效的批量推测解码方法EXSPEC,解决了在批量处理中因序列接受不同数量的草稿标记而导致的“锯齿张量”问题,保证了输出等价性,并显著提升了吞吐量。

Details Motivation: 推测解码通过小模型生成候选标记以加速大语言模型推理,但在批量处理中会引发锯齿张量问题,导致位置ID、注意力掩码和KV缓存错乱,现有实现常破坏输出等价性,因此需要一种正确且高效的批量方案。 Method: 首先分析确保正确性的同步条件,提出以正确性为先的EQSPEC方案;然后设计EXSPEC,通过维护一个滑动序列池并动态组建等长组来减少重对齐开销,同时保持每个序列的推测加速效果。 Result: 在SpecBench数据集上,使用多种目标/草稿模型组合测试,相比batch size 1,batch size 8时吞吐量最高提升3倍,且保持95%的输出等价性,无需自定义内核即可集成到现有推理系统中。 Conclusion: EXSPEC有效解决了批量推测解码中的锯齿张量问题,在不牺牲输出正确性的前提下显著提升批量推理效率,具有良好的实用性和可扩展性。 Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3$\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.

[72] Language Server CLI Empowers Language Agents with Process Rewards

Yifan Zhang,Lanser Contributors

Main category: cs.CL

TL;DR: Lanser-CLI 是一个面向命令行的编排层,通过集成语言服务器协议(LSP)为编码代理和持续集成提供确定性、可重放的工作流,利用语言服务器的精确代码事实来减少模型幻觉并提升编辑准确性。

Details Motivation: 大型语言模型常出现API幻觉和编辑错位问题,而语言服务器能提供经过验证的代码事实。作者希望结合两者优势,构建一个可靠、可重现的编程代理工作流系统。 Method: 提出Lanser-CLI,包含:(i) 基于Selector DSL的鲁棒代码定位机制;(ii) 标准化的分析包(Analysis Bundles)用于归一化LSP响应;(iii) 支持预览、工作区隔离和事务性应用的安全执行环境;(iv) 从语言服务器事实中导出可计算、可重放的过程奖励机制。 Result: 实现了确定性的代码操作流程,支持冻结快照下的重放与反事实分析,提出了单调过程奖励函数,可用于过程监督。系统支持符号、AST路径和内容锚定等多种选择器,并具备Git感知的事务处理能力。 Conclusion: 语言服务器不仅能提供结构化信息,还能作为过程奖励来源,有效对齐代理规划与真实程序状态。Lanser-CLI为编程代理提供了安全、可验证、可重现的操作框架。 Abstract: Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent's planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: https://github.com/yifanzhang-pro/lanser-cli

[73] Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Liwei Jiang,Yuanjun Chai,Margaret Li,Mickel Liu,Raymond Fok,Nouha Dziri,Yulia Tsvetkov,Maarten Sap,Alon Albalak,Yejin Choi

Main category: cs.CL

TL;DR: 本文提出了Infinity-Chat数据集,用于系统研究大语言模型在开放性生成任务中的模式崩溃和“人工蜂群思维”效应,揭示了模型间高度同质化的问题,并强调了对人类偏好多样性的建模不足。

Details Motivation: 语言模型在生成多样化、类人创意内容方面存在局限,可能导致人类思维的同质化,但缺乏可扩展的多样性评估方法。 Method: 构建包含26K个开放性问题的Infinity-Chat数据集和涵盖6大类17子类的提示分类体系,结合31,250条人类标注进行大规模分析,评估模型输出的多样性与人类偏好。 Result: 发现语言模型存在显著的‘人工蜂群思维’效应,表现为模型内重复和模型间高度同质;同时,现有模型对体现个体差异的人类偏好校准不足。 Conclusion: Infinity-Chat为研究开放性生成中的同质化风险提供了首个大规模资源,揭示了需关注多样性与个性化偏好的AI安全方向。 Abstract: Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

[74] Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts

Anwesan Pal,Karen Hovsepian,Tinghao Guo,Mengnan Zhao,Somendra Tripathi,Nikos Kanakaris,George Mihaila,Sumit Nigam

Main category: cs.CL

TL;DR: 本文提出了Tagging-Augmented Generation (TAG),一种轻量级的数据增强策略,用于提升大语言模型在长上下文场景下的性能,实验显示在32K token的上下文中性能提升高达17%,在复杂的多跳问答中提升了2.9%。

Details Motivation: 现有的大语言模型在处理长且复杂的上下文时存在显著限制,尽管有如RAG等方法试图缓解这一问题,但它们对分块、嵌入和检索策略敏感,并依赖于大量的预处理步骤。 Method: 提出了一种名为Tagging-Augmented Generation (TAG)的轻量级数据增强策略,通过在上下文中添加标签或仅在QA提示中加入标签定义来增强模型性能。 Result: 在NoLima和NovelQA两个具有挑战性的问答基准上验证了该假设,结果显示TAG方法相比基线模型在32K token的上下文中性能提升了最高达17%,在复杂推理的多跳查询中提升了2.9%。 Conclusion: TAG是一种有效的轻量级方法,能够在不改变检索文档完整性和组成的情况下,显著提高大语言模型在长上下文场景下的表现。 Abstract: Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at https://sites.google.com/view/tag-emnlp.

[75] MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

Yucheng Ning,Xixun Lin,Fang Fang,Yanan Cao

Main category: cs.CL

TL;DR: 提出了一种系统性方法来评估和提升长文本中大语言模型(LLM)输出的事实可靠性,构建了中文长文本事实性数据集LongHalluQA,并开发了基于辩论的多智能体验证系统MAD-Fact。

Details Motivation: 现有对短文本的评估方法在处理长文本时因复杂的推理链、交织的观点和累积信息而失效,难以确保大语言模型在高风险领域(如生物医学、法律和教育)中的事实准确性。 Method: 构建了一个大规模中文长文本事实性数据集LongHalluQA,设计了一个基于多智能体辩论的验证机制MAD-Fact,并引入带有权重的事实重要性层级来衡量不同主张的重要性。 Result: 在两个基准上的实验表明,较大的LLM通常具有更高的事实一致性,且国产模型在中文内容上表现更优。 Conclusion: 该工作为评估和增强长文本中LLM输出的事实可靠性提供了结构化框架,有助于指导其在敏感领域的安全部署。 Abstract: The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.

[76] Measuring Teaching with LLMs

Michael Hardy

Main category: cs.CL

TL;DR: 本研究提出基于句子嵌入的定制化大语言模型,用于客观评估课堂教学质量,在数据高效的训练下达到甚至超越人类评分者水平,并与教师增值指标对齐,展示了AI驱动教学评估的新方法。

Details Motivation: 传统通用大语言模型在应用复杂的课堂观察工具时表现不佳,难以准确处理长文本、解释性强的课堂转录内容,因此需要更适配的模型架构来实现可靠、可扩展的教学质量测量。 Method: 采用基于句子级嵌入的定制化大语言模型,系统评估五种不同句子嵌入方法,在数据高效训练框架下防止过拟合,并分析不同上下文窗口对评分归因的影响。 Result: 定制化模型在课堂质量评分上达到人类专家水平(相关性>0.65),并超过平均人-人评分者一致性;模型能更好从课程整体特征而非孤立语句中提取信号,聚合评分与教师增值指标对齐,但单项目层面尚未完全泛化。 Conclusion: 基于句子嵌入的定制LLM为AI驱动的教学质量评估提供了可行且强大的新方法,有望实现可扩展、可靠且有效的教师反馈机制。 Abstract: Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.

[77] Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures

Shenran Wang,Timothy Tin-Long Tse,Jian Zhu

Main category: cs.CL

TL;DR: 本文对基于知识的上下文学习任务中不同架构的大语言模型进行了深入评估,发现尽管模型在任务表现上相似,但其内部机制存在差异。研究指出,功能向量(FVs)主要位于自注意力和Mamba层,且FVs在参数化知识检索中起关键作用,但在上下文理解中不显著。此外,推测Mamba2使用不同于FVs的机制进行上下文学习。

Details Motivation: 旨在揭示不同架构的大语言模型在上下文学习中的内部工作机制差异,尤其是在知识检索与上下文理解任务中的表现差异。 Method: 结合行为探测与干预方法,对Transformer、状态空间模型及混合架构的大语言模型进行分析,识别功能向量(FVs)的位置及其作用。 Result: 发现FVs主要存在于自注意力和Mamba层;FVs对参数化知识检索至关重要,但对上下文理解影响较小;Mamba2可能采用不同于FVs的ICL机制。 Conclusion: 不同架构的LLM虽在任务性能上表现相似,但内部机制存在本质差异;结合行为与机制分析有助于更全面理解模型能力。 Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.

[78] LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models

Sammriddh Gupta,Sonit Singh,Aditya Joshi,Mira Kim

Main category: cs.CL

TL;DR: 本文介绍了一种基于LangChain框架和大语言模型构建的对话代理LangLingual,旨在为语言学习者提供实时语法反馈、上下文相关的练习和学习进度跟踪。

Details Motivation: 语言教育者希望为学习者提供丰富的学习体验,但受限于能够提供的反馈和练习范围。 Method: 设计并开发了基于LangChain框架和大语言模型的对话代理LangLingual,具备实时语法反馈、生成上下文相关练习和跟踪学习者水平的功能。 Result: 评估结果显示系统具有良好的可用性、积极的学习效果和较高的学习者参与度。 Conclusion: LangLingual有效支持语言学习,有助于弥补教育者在反馈和练习提供方面的限制。 Abstract: Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and development of LangLingual, a conversational agent built using the LangChain framework and powered by Large Language Models. The system is specifically designed to provide real-time, grammar-focused feedback, generate context-aware language exercises and track learner proficiency over time. The paper discusses the architecture, implementation and evaluation of LangLingual in detail. The results indicate strong usability, positive learning outcomes and encouraging learner engagement.

[79] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu,Jingjing Chen,Jiayu Ye,Yu Wu,Jun Yan,Carl Yang,Hongkun Yu

Main category: cs.CL

TL;DR: TIR-Judge是一种基于工具增强推理的端到端强化学习框架,通过集成代码执行器提升大语言模型作为评判者的能力,在多种评估任务中超越现有方法。

Details Motivation: 现有LLM评判者主要依赖文本内推理,难以验证复杂约束或精确计算,限制了其评估准确性。受工具集成推理在其他任务中的成功启发,本文旨在提升LLM评判者的验证与计算能力。 Method: 提出TIR-Judge框架,结合代码执行器进行精确评估;采用跨可验证与不可验证领域的多样化训练、灵活的判断格式(逐点、成对、列表式),并通过无需蒸馏的迭代强化学习实现自举训练。 Result: 在七个公开基准上,TIR-Judge比强推理型评判者最高提升6.4%(逐点)和7.7%(成对),8B参数下列表式性能媲美Claude-Opus-4;TIR-Judge-Zero无需蒸馏数据即可达到蒸馏版本的性能。 Conclusion: 工具增强的LLM评判者可通过迭代强化学习自我进化,无需依赖人工标注或蒸馏数据,显著提升评估准确性与泛化能力。 Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.

[80] Knocking-Heads Attention

Zhanchao Zhou,Xiaodong Chen,Haoxing Chen,Zhenzhong Lan,Jianguo Li

Main category: cs.CL

TL;DR: 提出了一种名为knocking-heads attention (KHA)的新机制,通过在注意力头之间引入特征级交互来增强多头注意力的表示能力,仅增加少量参数即可提升训练稳定性和下游任务性能。

Details Motivation: 现有注意力机制(如MHA、GQA、GTA)中各头独立运作,缺乏强交互,且增加头数会削弱单个头的表征能力。 Method: 引入共享的对角初始化投影矩阵,使注意力头在计算注意力前进行特征级交互(即“敲头”),保留初始专业化并逐步学习跨头整合表示。 Result: 在6.1B参数的MoE模型上训练1T token验证KHA,相比基线方法具有更优且更稳定的训练动态,并在下游任务中表现更好。 Conclusion: KHA能有效增强注意力头间的协作,提升模型性能,同时兼容MHA及其变体,具备低额外开销和高可扩展性。 Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

[81] Quality-Aware Translation Tagging in Multilingual RAG system

Hoyeon Moon,Byeolhee Kim,Nikhil Verma

Main category: cs.CL

TL;DR: 提出了一种质量感知的翻译标注方法QTT-RAG,用于多语言检索增强生成,通过评估翻译质量并保留原文内容,在低资源语言场景下提升响应生成性能。

Details Motivation: 现有方法在处理低资源语言时依赖英文文档翻译,但翻译质量差会导致生成性能下降,且重写方法易引入事实错误和幻觉问题。 Method: QTT-RAG从语义等价性、语法准确性和自然流畅性三个维度显式评估翻译质量,并将评分作为元数据附加而不修改原文。 Result: 在XORQA和MKQA两个开放域问答基准上,使用6个指令调优的大语言模型进行评估,QTT-RAG在韩语、芬兰语(低资源)和中文(高资源)中均优于CrossRAG和DKM-RAG基线方法。 Conclusion: QTT-RAG在保持事实完整性的同时,使生成模型能基于翻译可靠性做出更优决策,为多语言场景下的跨语言文档利用提供了实用且鲁棒的解决方案。 Abstract: Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality degrades response generation performance. Existing approaches either assume sufficient translation quality or utilize the rewriting method, which introduces factual distortion and hallucinations. To mitigate these problems, we propose Quality-Aware Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation quality along three dimensions-semantic equivalence, grammatical accuracy, and naturalness&fluency-and attach these scores as metadata without altering the original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs ranging from 2.4B to 14B parameters, covering two low-resource languages (Korean and Finnish) and one high-resource language (Chinese). QTT-RAG outperforms the baselines by preserving factual integrity while enabling generator models to make informed decisions based on translation reliability. This approach allows for effective usage of cross-lingual documents in low-resource settings with limited native language documents, offering a practical and robust solution across multilingual domains.

[82] A Survey on LLM Mid-training

Chengying Tu,Xuemiao Zhang,Rongxiang Weng,Rumei Li,Chen Zhang,Yang Bai,Hongfei Yan,Jingang Wang,Xunliang Cai

Main category: cs.CL

TL;DR: 本文综述了大语言模型中“中期训练”(mid-training)的重要作用,提出了其正式定义,并探讨了数据管理、训练策略和模型架构优化等框架,阐明了中期训练在提升数学、编程、推理等能力方面的关键贡献。

Details Motivation: 近年来基础模型的多阶段训练显示出显著优势,其中中期训练作为连接预训练和后训练的关键阶段逐渐显现其重要性,但缺乏系统性定义和分析。 Method: 通过综述现有研究,提出中期训练的正式定义,并从数据构建、训练策略和模型结构优化等方面构建优化框架,分析主流模型在目标导向干预下的实现方式。 Result: 明确了中期训练在增强特定能力的同时保持基础性能的作用,建立了关于中期训练的全面分类体系,并提供了可操作的见解。 Conclusion: 中期训练是大语言模型发展过程中一个独特且关键的阶段,系统化地推进中期训练有助于未来语言模型的能力提升和创新。 Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.

[83] MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

Suchan Lee,Jihoon Choi,Sohyeon Lee,Minseok Song,Bong-Gyu Jang,Hwanjo Yu,Soyeon Caren Han

Main category: cs.CL

TL;DR: 提出了一种新的多方面提示框架MAP4TS,将经典时间序列分析融入提示设计,通过结合全局、局部、统计和时间提示组件,在多个数据集上优于现有的基于大语言模型的时间序列预测方法。

Details Motivation: 现有基于大语言模型的多模态时间序列预测方法忽略了时间序列数据特有的统计特性和时间依赖性,导致性能受限。 Method: 设计了四个专门的提示组件:全局领域提示、局部领域提示、统计提示(基于ACF、PACF)和时间提示(基于傅里叶分析),并将这些提示与原始时间序列嵌入结合,通过跨模态对齐模块输入LLM进行预测。 Result: 在八个不同数据集上的实验表明,MAP4TS持续优于最先进的LLM-based预测方法;消融研究显示提示设计显著提升稳定性,且GPT-2配合结构化提示在长期预测中表现优于LLaMA等更大模型。 Conclusion: 将经典时间序列分析知识融入提示设计能有效提升大语言模型在时间序列预测中的性能和稳定性,结构化提示比模型规模更重要。 Abstract: Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

[84] Leveraging Hierarchical Organization for Medical Multi-document Summarization

Yi-Li Hsu,Katelyn X. Mei,Lucy Lu Wang

Main category: cs.CL

TL;DR: 该论文研究了在医学多文档摘要(MDS)中引入层次化结构是否能提升模型对跨文档信息的组织与上下文理解能力,结果表明层次化方法在保持事实性、覆盖性和连贯性的同时,显著提升了摘要的清晰度和人类偏好。

Details Motivation: 医学多文档摘要需要有效管理跨文档关系,传统扁平化方法在信息组织和上下文整合方面存在局限,因此探索层次化结构的引入以提升摘要质量。 Method: 研究在三种大语言模型中探索了两种引入层次化结构的方法,并通过自动指标、基于模型的指标以及领域专家在多个维度(如可理解性、清晰度、相关性等)进行综合评估。 Result: 人类专家更偏好模型生成的摘要而非人工撰写的摘要;层次化方法在事实性、覆盖性和连贯性方面表现良好,且显著提升人类对摘要的偏好;GPT-4的模拟判断与人类判断在客观维度上具较高一致性。 Conclusion: 引入层次化结构能够有效提升医学多文档摘要的清晰度和用户偏好,同时保持内容完整性,为生成高质量医学摘要提供了一种可行路径。 Abstract: Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.

[85] Flexing in 73 Languages: A Single Small Model for Multilingual Inflection

Tomáš Sourada,Jana Straková

Main category: cs.CL

TL;DR: 提出了一种轻量级的单模型多语言屈折生成方法,联合训练73种语言,在大多数语言上优于单语基线,并公开了代码。

Details Motivation: 解决缺乏开源、通用、能处理多种语言(包括捷克语)且能应对未登录词的多语言形态屈折系统的问题。 Method: 采用联合训练方式,在73种语言的数据上训练一个紧凑的单模型,并引入基于频率加权、词元不重叠的训练-开发-测试重采样方法以确保数据划分更真实。 Result: 模型在标准SIGMORPHON任务和73个UD树库上表现良好,对未见词鲁棒,且无需为每种语言单独维护模型,简化部署。 Conclusion: 多语言建模在形态屈折任务中是有效且实用的,能够显著减少模型管理成本,同时保持高性能。 Abstract: We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained jointly on data from 73 languages, is lightweight, robust to unseen words, and outperforms monolingual baselines in most languages. This demonstrates the effectiveness of multilingual modeling for inflection and highlights its practical benefits: simplifying deployment by eliminating the need to manage and retrain dozens of separate monolingual models. In addition to the standard SIGMORPHON shared task benchmarks, we evaluate our monolingual and multilingual models on 73 Universal Dependencies (UD) treebanks, extracting lemma-tag-form triples and their frequency counts. To ensure realistic data splits, we introduce a novel frequency-weighted, lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack of an open-source, general-purpose, multilingual morphological inflection system capable of handling unseen words across a wide range of languages, including Czech. All code is publicly released at: https://github.com/tomsouri/multilingual-inflection.

[86] Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation

Shiwei Li,Xiandi Luo,Haozhao Wang,Xing Tang,Ziqiang Cui,Dugang Liu,Yuhua Li,Xiuqiang He,Ruixuan Li

Main category: cs.CL

TL;DR: 提出了一种新的低秩适应方法TopLoRA,通过为每个输入token动态调整LoRA权重,实现更细粒度的参数高效微调,在多个模型和数据集上优于标准LoRA及其变体。

Details Motivation: 标准LoRA中所有token共享相同的权重,无法有效捕捉token之间的语义差异,限制了其表达能力。因此需要一种能够根据输入token动态调整投影的方式以增强模型对token特异性信息的建模能力。 Method: 提出Token-wise Projected Low-Rank Adaptation(TopLoRA),将LoRA的权重表示为$B\Sigma_X A$,其中$A$和$B$是低秩矩阵,$\Sigma_X$是由输入token $X$生成的对角矩阵,从而实现token级别的自适应投影,且不增加权重的秩。 Result: 在多个模型和数据集上的实验表明,TopLoRA在保持参数效率的同时,显著优于标准LoRA及其变体,验证了其在捕捉token特异性信息方面的有效性。 Conclusion: TopLoRA通过引入token相关的动态权重调整机制,在不提升秩的前提下增强了LoRA的表达能力,是一种更精细、更有效的参数高效微调方法。 Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a low-dimensional output space, with the dimensionality determined by the LoRA rank. In standard LoRA, all input tokens share the same weights and undergo an identical input-output projection. This limits LoRA's ability to capture token-specific information due to the inherent semantic differences among tokens. To address this limitation, we propose Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts LoRA weights according to the input token, thereby learning token-wise input-output projections in an end-to-end manner. Formally, the weights of TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA weights but achieves more granular adaptation by learning token-wise LoRA weights (i.e., token-wise input-output projections). Extensive experiments across multiple models and datasets demonstrate that TopLoRA consistently outperforms LoRA and its variants. The code is available at https://github.com/Leopold1423/toplora-neurips25.

[87] Corpus Frequencies in Morphological Inflection: Do They Matter?

Tomáš Sourada,Jana Straková

Main category: cs.CL

TL;DR: 本文探讨了在形态屈折任务中引入语料库频率信息的方法,提出在数据划分、评估指标和训练采样三个维度上考虑词频,以更贴近真实语言使用场景。

Details Motivation: 传统形态屈折方法忽略词频分布,而实际应用中用户输入往往反映自然文本的真实频率分布,因此需要在系统开发中纳入频率信息。 Method: (1)采用词元不相交与频率加权结合的训练-开发-测试划分策略;(2)引入基于词频加权的token准确率作为评估指标;(3)提出频率感知训练方法,将词频显式融入训练数据采样过程。 Result: 频率感知训练在43种语言中的26种上优于均匀采样。 Conclusion: 在形态屈折任务中引入频率信息有助于提升模型在真实场景下的性能,尤其是在高频词占主导的运行文本中。 Abstract: The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma-tag-form triples uniformly, lacking any information about their frequency distribution. However, in production deployment, one might expect the user inputs to reflect a real-world distribution of frequencies in natural texts. With future deployment in mind, we explore the incorporation of corpus frequency information into the task of morphological inflection along three key dimensions during system development: (i) for train-dev-test split, we combine a lemma-disjoint approach, which evaluates the model's generalization capabilities, with a frequency-weighted strategy to better reflect the realistic distribution of items across different frequency bands in training and test sets; (ii) for evaluation, we complement the standard type accuracy (often referred to simply as accuracy), which treats all items equally regardless of frequency, with token accuracy, which assigns greater weight to frequent words and better approximates performance on running text; (iii) for training data sampling, we introduce a method novel in the context of inflection, frequency-aware training, which explicitly incorporates word frequency into the sampling process. We show that frequency-aware training outperforms uniform sampling in 26 out of 43 languages.

[88] ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix

Zile Yang,Ling Li,Na Di,Jinlong Pang,Yao Zhou,Hao Cheng,Bo Han,Jiaheng Wei

Main category: cs.CL

TL;DR: ENTP框架通过符号净化和神经重建来增强低质量的监督微调数据,实验证明其在多个指令跟随基准上优于现有方法。

Details Motivation: 现有的高质量数据筛选方法忽视了低质量数据中的有价值信息,且依赖不完美的质量过滤器。 Method: 提出ENTP框架,结合符号模块(基于统计先验去除噪声样本)和神经模块(利用潜在表示和模型知识合成增强的指令-响应对)。 Result: 使用ENTP构建的数据集仅从低质量数据中提取,但在五个指令跟随基准上优于13种现有数据选择基线,并超过使用完整原始数据集(约30万样本)微调的效果。 Conclusion: 低质量数据具有未被开发的潜力,智能的净化与合成对于高效的指令对齐至关重要。 Abstract: Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.

[89] Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs

Hang Lei,Shengyi Zong,Zhaoyan Li,Ziren Zhou,Hao Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为双阶段优化(DSR)的分解框架,通过将创意叙事生成与格式转换分离,提升大语言模型生成高质量剧本的能力。

Details Motivation: 直接端到端生成剧本的方法往往无法兼顾创造性叙事与严格格式要求,导致输出缺乏结构完整性和叙事深度。 Method: 采用双阶段框架:第一阶段将简要提纲转化为小说风格的叙述文本;第二阶段将该文本精炼为专业格式的剧本。并通过混合数据合成方法解决训练数据稀缺问题。 Result: 专业编剧的盲评结果显示,DSR对Gemini-2.5-Pro等强基线模型的胜率达75%,性能达到人类水平的82.7%。 Conclusion: 分解生成架构结合定制化数据合成能有效提升大语言模型在复杂创意任务中的专业化能力。 Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.

[90] MATCH: Task-Driven Code Evaluation through Contrastive Learning

Marah Ghoummaid,Vladimir Tchuiev,Ofek Glick,Michal Moschkovitz,Dotan Di Castro

Main category: cs.CL

TL;DR: 本文提出了一种新的无参考代码生成评估指标MATCH,利用对比学习生成代码和自然语言任务描述的有意义嵌入,以衡量生成代码与开发者意图的一致性。

Details Motivation: 现有代码生成评估方法如单元测试成本高,语法相似性指标无法捕捉功能一致性,且许多指标需要参考代码,缺乏有效的无参考评估手段。 Method: 提出MATCH,采用对比学习方法对代码和自然语言任务描述进行嵌入,实现无需参考代码的功能相似性评分。 Result: MATCH在多种编程语言上均表现出比现有指标更强的功能正确性和人类偏好相关性。 Conclusion: MATCH是一种有效的无参考代码生成评估指标,在功能对齐和人类偏好预测方面优于传统方法。 Abstract: AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

[91] SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

Shuai Huang,Wenxuan Zhao,Jun Gao

Main category: cs.CL

TL;DR: 本文提出了SI-Bench,一个基于真实社交应用对话的新基准,用于评估大语言模型在复杂社会互动中的社交智能表现。实验表明,尽管当前最先进的模型在复杂情境下的推理能力超过人类专家,但在回复质量上仍落后于人类,且引入思维链(CoT)可能降低其在社交对话任务中的表现。

Details Motivation: 现有研究多依赖模拟的代理间交互数据,难以捕捉真实人类对话中的语言风格和关系动态,缺乏对大语言模型在真实社会互动中社交智能的可靠评估。因此,需要构建基于真实人类互动的评估基准。 Method: 基于广泛的社会科学理论,收集了来自社交网络应用的2,221个真实多轮对话,构建SI-Bench基准,并对其中312个对话进行人工标注,评估8个主流大语言模型在社交智能方面的表现。 Result: 实验结果显示,最先进的模型在复杂社会情境下的过程推理能力已超过人类专家,但在回复质量方面仍不及人类;此外,引入思维链(CoT)推理反而可能损害模型在社交对话中的表现。 Conclusion: SI-Bench为评估大语言模型的社交智能提供了更真实可靠的基准,揭示了当前模型在社交对话中仍存在的局限性,尤其是在生成高质量回复方面,且提示复杂推理机制(如CoT)并不总能提升社交互动表现。 Abstract: As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.

[92] DREaM: Drug-Drug Relation Extraction via Transfer Learning Method

Ali Fata,Hossein Rahmani,Parinaz Soltanzadeh,Amirhossein Derakhshan,Behrouz Minaei Bidgoli

Main category: cs.CL

TL;DR: 提出DREAM方法,利用训练好的关系提取模型和大语言模型验证,从医学文本中构建药物关系本体。

Details Motivation: 现有药物-药物关系提取数据集有限,需要借助迁移学习应用机器学习方法。 Method: 首先使用预训练的关系提取模型发现实体间关系,应用于医学文本语料库构建药物关系本体,并用大语言模型进行验证。 Result: 定量结果显示大语言模型对从PubMed摘要子集中提取的71个关系表示同意,定性分析表明该方法能揭示医学领域的模糊性。 Conclusion: DREAM方法可有效提取药物关系并揭示医学关系抽取中的挑战,具有潜在应用价值。 Abstract: Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.

[93] Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

Alois Thomas,Maya Varma,Jean-Benoit Delbrouck,Curtis P. Langlotz

Main category: cs.CL

TL;DR: 提出一种轻量级、上下文感知的句子级过程奖励模型(PRM),用于检测放射学报告生成中的幻觉,具有良好的跨模型泛化能力,并能有效提升临床文本生成的质量。

Details Motivation: 大型视觉-语言模型(LVLMs)在自动生成放射学报告时容易产生临床关键性幻觉,现有检测方法缺乏足够的句子级细粒度或跨模型的泛化能力。 Method: 设计并微调一个0.5B参数的句子级过程奖励模型(PRM),基于临床上下文和前置文本预测每句话的事实正确性,使用MIMIC-CXR数据集上的弱监督标签进行训练。 Result: 该PRM在多个指标上优于现有验证方法,例如在一种LVLM输出上相对提升7.5%的MCC和1.8%的AUROC;能够泛化到未见过的LVLM;通过过滤最差10%报告使F1-CheXbert提升4.5%;加权best-of-N选择进一步提升F1-CheXbert 7.4%和BERTScore 0.6%。 Conclusion: 轻量级、上下文感知的PRM可作为无需访问内部激活的模型无关安全层,有效缓解临床LVLM中的幻觉问题,提升生成报告的质量与安全性。 Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations

[94] Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Tawsif Tashwar Dipto,Azmol Hossain,Rubayet Sabbir Faruque,Md. Rezuwan Hassan,Kanij Fatema,Tanmoy Shome,Ruwad Naswan,Md. Foriduzzaman Zihad,Mohaymen Ul Anam,Nazia Tasnim,Hasan Mahmud,Md Kamrul Hasan,Md. Mehedi Hasan Shawon,Farig Sadeque,Tahsin Reasat

Main category: cs.CL

TL;DR: 本文介绍了Ben-10,一个78小时标注的孟加拉语语音转文本语料库,用于研究方言变异对低资源语言自动语音识别(ASR)的影响。研究表明,现有语音基础模型在区域方言ASR任务上表现不佳,无论是在零样本还是微调设置下。深度学习方法在处理方言变化时普遍存在困难,但针对特定方言训练模型可有效缓解这一问题。该数据集还可作为受限资源下ASR算法的分布外(OOD)资源。

Details Motivation: 研究方言变异对低资源语言自动语音识别(ASR)性能的影响,并探索现有语音基础模型在区域方言识别中的局限性。 Method: 构建了一个名为Ben-10的78小时标注孟加拉语语音转文本语料库,从语言学和数据驱动两个角度分析方言变异对ASR的影响,并评估多种深度学习方法在零样本和微调设置下的表现。 Result: 发现当前语音基础模型在区域方言ASR任务中表现严重下降,无论是否进行微调;所有深度学习方法在建模方言变异时均面临挑战,但使用特定方言数据训练模型可显著改善性能。Ben-10数据集可作为低资源条件下ASR研究的分布外测试资源。 Conclusion: 方言变异对ASR系统构成重大挑战,通用或微调模型难以有效应对;为特定方言单独训练模型是更有效的解决方案,且Ben-10为低资源方言ASR研究提供了有价值的公开资源。 Abstract: Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

[95] Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding

Mohammed Aljafari,Ismail Alturki,Ahmed Mori,Yehya Kadumi

Main category: cs.CL

TL;DR: Mubeen是一个专有的阿拉伯语大模型,专注于阿拉伯语言、伊斯兰研究和文化遗产的理解,通过原生阿拉伯语数据和独特的实用闭合架构解决传统模型在意图识别和用户实用性上的不足,支持沙特2030愿景。

Details Motivation: 现有阿拉伯语模型多依赖翻译自英语的数据,在理解用户意图、文化背景和实际需求方面表现不佳,导致信息准确但实用性差,形成‘效用差距危机’。 Method: 基于大量真实阿拉伯语资料(包括历史手稿、语言学、教法、圣训、古兰经注释、学术论文等)进行训练,并采用自主研发的阿拉伯OCR技术扩展数据集;通过深层语言工程框架和‘实用闭合架构’提升对古典文本、现代写作及方言的理解与响应决策能力。 Result: Mubeen在理解阿拉伯语的语言美感、用户意图和上下文相关性方面表现出色,能够在文化遗产保护与通用知识领域实现强大性能,减少用户重复提问,提供明确且具指导性的回应。 Conclusion: Mubeen通过原生阿拉伯语数据和创新架构,实现了从信息提供者到决策引导者的转变,推动阿拉伯语AI向文化真实性和实际效用迈进,契合沙特2030愿景。 Abstract: Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive collection of authentic Arabic sources significantly expanded by digitizing historical manuscripts via a proprietary Arabic OCR engine, the model incorporates seminal scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside thousands of academic theses and peer-reviewed research papers. Conditioned through a deep linguistic engineering framework, Mubeen masters not just the meaning but the eloquence of Arabic, enabling precise understanding across classical texts, contemporary writing, and regional dialects with focus on comprehending user intent and delivering accurate, contextually relevant responses. Unlike other Arabic models relying on translated English data that often fail in intent detection or retrieval-augmented generation (RAG), Mubeen uses native Arabic sources to ensure cultural authenticity and accuracy. Its core innovation is the Practical Closure Architecture, designed to solve the "Utility Gap Crisis" where factually correct answers fail to resolve users' core needs, forcing them into frustrating cycles of re-prompting. By prioritizing clarity and decisive guidance, Mubeen transforms from an information repository into a decisive guide, aligning with Saudi Vision 2030. The model's architecture combines deep heritage specialization with multi-disciplinary expert modules, enabling robust performance across both cultural preservation and general knowledge domains.

[96] Code Aesthetics with Agentic Reward Feedback

Bang Xiao,Lingjie Jiang,Shaohan Huang,Tengchao Lv,Yupan Huang,Xun Wu,Lei Cui,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种提升大语言模型生成代码美学质量的新方法,通过构建大规模数据集AesCode-358K、设计多智能体奖励反馈机制和优化算法GRPO-AR,并开发了评估基准OpenDesign,显著提升了代码美观性与功能性。

Details Motivation: 大语言模型在传统编程任务中表现良好,但在视觉导向的编码任务中美学表现不佳,因此需要提升生成代码的美学质量。 Method: 构建AesCode-358K数据集,提出基于多智能体的奖励反馈机制(评估可执行性、静态美学和交互美学),结合GRPO算法进行功能与美学的联合优化,并开发OpenDesign作为评估基准。 Result: 实验表明,结合监督微调与强化学习的方法显著提升了在OpenDesign和PandasPlotBench等基准上的表现,AesCoder-4B模型性能超越GPT-4o和GPT-4.1,媲美参数量达480B-685B的大型开源模型。 Conclusion: 所提出的方法能有效提升LLM生成代码的美学质量,同时保持甚至增强其功能性,为代码生成模型的发展提供了新方向。 Abstract: Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.

[97] A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Thai-Binh Nguyen,Katerina Zmolikova,Pingchuan Ma,Ngoc Quan Pham,Christian Fuegen,Alexander Waibel

Main category: cs.CL

TL;DR: 本文介绍了第九届CHiME挑战赛中的多模态上下文感知识别(MCoRec)任务,旨在通过音频、视觉和上下文线索解决单房间内重叠对话的“鸡尾酒会问题”。

Details Motivation: 解决真实场景中多人非正式对话的极端语音重叠问题,提升复杂环境下的语音识别能力。 Method: 结合音频、视觉和上下文线索,对说话人进行语音转录并聚类为各自的对话流。 Result: 纯音频基线系统的词错误率超过100%,引入视觉信息后性能提升50%。 Conclusion: 多模态融合对于解决高度重叠的自然多说话人对话识别至关重要。 Abstract: We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.

[98] DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

Yuanzhen Xie,Liu Ye,Jiqun Chu,Mochi Gao,Hehuan Liu,Yunzhi Tan,Bo Hu,Zang Li

Main category: cs.CL

TL;DR: 本文提出了一种面向text-to-SQL任务的全自动数据中心化流程,包括自适应数据修复和错误数据增强,并结合多模型协作训练与集成策略,显著提升了轻量级模型的性能,在70B以内模型中排名第一。

Details Motivation: 尽管基于代理的框架在text-to-SQL任务中广泛应用,但数据驱动策略的影响尚未被充分探索,因此本文旨在系统性研究数据中心化方法对该任务的提升作用。 Method: 设计了一个包含自适应数据修复和错误数据增强的数据中心化流水线,采用多模型协作训练模式,每个模型使用不同的增强数据进行训练,并通过集成策略融合多个模型的能力以解决多选题。 Result: 实验结果和消融研究表明,所提出的数据中心化流程和多模型交互迭代策略有效提升了text-to-SQL任务的准确性,在轻量级模型中取得了第一名的成绩(70B以内)。 Conclusion: 数据中心化策略结合多模型协作训练能显著提升text-to-SQL任务的性能,尤其适用于轻量级模型,为未来相关研究提供了有效范式。 Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).

[99] Arabic Little STT: Arabic Children Speech Recognition Dataset

Mouhand Alkadri,Dania Desouki,Khloud Al Jallad

Main category: cs.CL

TL;DR: 本文介绍了阿拉伯语儿童语音数据集Arabic Little STT,并评估了Whisper模型在该数据上的表现,结果显示现有ASR模型在儿童语音识别上性能较差,强调需要专门的儿童语音基准和包容性训练数据。

Details Motivation: 低资源语言如阿拉伯语缺乏儿童专用语音语料库,导致语音识别系统在儿童群体中表现不佳,亟需填补这一空白。 Method: 构建了一个包含288名6至13岁儿童的355条黎凡特阿拉伯语语音样本的数据集Arabic Little STT,并系统评估了八个Whisper模型变体在该数据集上的词错误率(WER),与成人阿拉伯语基准进行对比。 Result: 即使表现最好的Whisper Large_v3模型在儿童语音上的词错误率仍高达0.66,远高于其在成人数据上的低于0.20的表现,表明当前ASR模型在儿童语音识别方面存在显著挑战。 Conclusion: 必须建立专门的儿童语音识别基准和更具包容性的训练数据,并在严格伦理和隐私框架下保护儿童敏感信息,以推动面向阿拉伯语儿童的公平语音技术发展。 Abstract: The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.

[100] Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models

Mohammad Atif Quamar,Mohammad Areeb,Nishant Sharma,Ananth Shreekumar,Jonathan Rosenthal,Muslum Ozgur Ozmen,Mikhail Kuznetsov,Z. Berkay Celik

Main category: cs.CL

TL;DR: 提出了一种新的推理时对齐方法AdaSearch,通过自适应分配计算资源来优化关键初始token的生成,显著提升了LLM在多个任务上的表现。

Details Motivation: 现有的推理时对齐方法计算资源分配均匀,导致对关键初始token的关注不足,效果不佳。 Method: 引入AdaSearch,采用分块搜索策略和采样调度机制,自适应地将计算预算集中在响应的初始关键token上;并扩展到序列解码和树搜索(AdaBeam)。 Result: 在八个大语言模型上的实验表明,AdaSearch在无害性生成、情感控制生成和数学推理任务中相比Best-of-N基线提升超过10%的胜率,优于强基线方法。 Conclusion: 通过动态分配计算资源以聚焦关键生成阶段,AdaSearch为LLM推理时对齐提供了一种高效且通用的解决方案。 Abstract: LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight LLMs demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.

[101] BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning

Siyuan Zheng,Pai Liu,Xi Chen,Jizheng Dong,Sihan Jia

Main category: cs.CL

TL;DR: 提出首个基于BaZi的问答数据集和BaZi-LLM系统,结合符号推理与大语言模型生成动态虚拟人格,在准确率上显著优于主流LLM。

Details Motivation: 现有虚拟人格生成方法依赖标注数据或手工提示,难以扩展且缺乏情境连贯性,需探索更可扩展、文化可信的解决方案。 Method: 构建基于BaZi(八字)的生命事件QA数据集,融合象征性推理与大语言模型,实现时间动态且细粒度的虚拟人格生成。 Result: 相比DeepSeek-v3和GPT-5-mini等主流LLM,准确率提升30.3%-62.6%;当输入错误BaZi信息时,模型性能下降20%-45%,验证了文化符号整合的有效性。 Conclusion: 文化根基的符号系统与LLM结合可有效提升虚拟人格的真实性与上下文一致性,为角色生成提供新范式。 Abstract: Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.

[102] LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data

Teng Lin

Main category: cs.CL

TL;DR: 本文提出了LightKGG框架,利用小规模语言模型高效地从文本中提取知识图谱,通过上下文集成和拓扑增强关系推断两项技术创新,在低资源环境下实现了准确且实用的知识图谱构建。

Details Motivation: 现有的知识图谱提取方法依赖于易出错的模式匹配技术或计算资源消耗大的大语言模型,难以在低资源环境中广泛应用,因此需要一种更高效、轻量化的解决方案。 Method: 提出LightKGG框架,包含两项关键技术:1)上下文集成的图提取,将上下文信息与节点和边整合到统一图结构中;2)拓扑增强的关系推断,利用图的内在拓扑结构进行关系发现。 Result: LightKGG能够在低硬件需求下实现准确的知识图谱构建,显著降低对大型语言模型的依赖,同时保持关键信息的完整性。 Conclusion: 该工作弥合了自动化知识提取与实际部署之间的差距,为小规模语言模型在结构化自然语言处理任务中的高效应用提供了科学严谨的方法。 Abstract: The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching techniques or resource-intensive large language models (LLMs). While recent tools leverage LLMs to generate KGs, their computational demands limit accessibility for low-resource environments. Our paper introduces LightKGG, a novel framework that enables efficient KG extraction from textual data using small-scale language models (SLMs) through two key technical innovations: (1) Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure, reducing the reliance on complex semantic processing while maintaining more key information; (2) Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships, enabling relationship discovery without relying on complex language understanding capabilities of LLMs. By enabling accurate KG construction with minimal hardware requirements, this work bridges the gap between automated knowledge extraction and practical deployment scenarios while introducing scientifically rigorous methods for optimizing SLM efficiency in structured NLP tasks.

[103] How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes

Sheri Osborn,Rohit Valecha,H. Raghav Rao,Dan Sass,Anthony Rios

Main category: cs.CL

TL;DR: 本文提出一个评估大语言模型(LLM)预测人工智能对就业影响能力的基准,结合美国高频职位数据和全球职业变化预测数据,设计了时间划分明确的预测任务,并比较多种提示策略的效果,发现结构化任务提示可提升预测稳定性,而角色提示在短期趋势中表现更优,结果强调需结合领域知识设计提示并建立严格评估机制。

Details Motivation: 缺乏系统性工具来预测人工智能对劳动力市场的影响,现有研究未充分评估大语言模型在前瞻性就业预测中的能力。 Method: 构建包含美国行业级职位发布指数和全球AI影响职业变化数据集的基准,设计带时间分割的预测任务,采用任务引导、角色驱动和混合提示策略评估不同LLM的表现。 Result: 结构化任务提示提升预测稳定性,角色提示在短期趋势预测中更具优势,但整体性能因行业和预测周期而异,显示领域适配提示与严谨评估的重要性。 Conclusion: 所提出的基准有助于推动基于大语言模型的劳动力市场预测研究,为提示设计和AI经济推理提供可复现的测试平台。 Abstract: Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.

[104] Detecting Religious Language in Climate Discourse

Evy Beijen,Pien Pieterse,Yusuf Çelik,Willem Th. van Peursen,Sandjai Bhulai,Meike Morren

Main category: cs.CL

TL;DR: 本研究探讨了世俗和宗教非政府组织在气候变化文本中使用的显性和隐性宗教语言,比较了基于规则的模型和大语言模型(零样本设置)在检测宗教语言方面的表现。

Details Motivation: 宗教语言在当代世俗领域(如环保运动)中持续存在,但如何计算性地识别和定义宗教语言仍存在方法论挑战。 Method: 提出双方法:一是基于生态神学文献构建宗教术语层次树的规则模型;二是使用零样本设置的大语言模型。数据集包含超过88万句话。 Result: 基于规则的方法比大语言模型标记出更多宗教语句,两种方法在识别上存在显著差异,反映出词汇与语境在定义宗教语言时的张力。 Conclusion: 研究揭示了计算识别宗教语言的方法局限与潜力,强调需兼顾词汇与语境,推动数字宗教研究方法的发展。 Abstract: Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs). We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting. Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence. The results show that the rule-based method consistently labels more sentences as religious than LLMs. These findings highlight not only the methodological challenges of computationally detecting religious language but also the broader tension over whether religious language should be defined by vocabulary alone or by contextual meaning. This study contributes to digital methods in religious studies by demonstrating both the potential and the limitations of approaches for analyzing how the sacred persists in climate discourse.

[105] EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting

Musleh Alharthi,Kaleel Mahmood,Sarosh Patel,Ausif Mahmood

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer的混合专家(MoE)框架,结合了xLSTM、增强线性模型、PatchTST和minGRU等最先进的时序预测模型,在标准基准上超越了现有方法。

Details Motivation: 近期研究对Transformer和大语言模型在时序预测中的有效性提出质疑,同时时序数据偏重新近历史且存在不可预测事件,因此需要更强大且鲁棒的模型。 Method: 构建一个基于Transformer的MoE门控网络,集成多种互补的SOTA时序预测模型,包括xLSTM、增强线性模型、PatchTST和minGRU。 Result: 所提模型在多个标准基准上均优于现有的时序预测模型,包括最新的基于MoE的方法。 Conclusion: 通过融合多样化且互补的模型,所提出的MoE框架显著提升了时序预测性能,代表了新的SOTA方向。 Abstract: The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance. One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a strong Mixture of Experts (MoE) framework. Our method combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms all existing TSF models on standard benchmarks, surpassing even the latest approaches based on MoE frameworks.

[106] Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Zhuoran Jin,Hongbang Yuan,Kejian Zhu,Jiachun Li,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 本文提出了Omni-Reward,以解决奖励模型在多模态支持和个性化偏好建模方面的不足,包括构建新的基准Omni-RewardBench、数据集Omni-RewardData,并提出兼具判别与生成能力的Omni-RewardModel。

Details Motivation: 现有奖励模型主要局限于文本和图像模态,且偏好训练方式僵化,难以捕捉复杂多样的个性化偏好,因此需要更通用的多模态奖励模型。 Method: 提出Omni-Reward,包含三个部分:1)Omni-RewardBench,首个支持自由形式偏好的多模态基准;2)Omni-RewardData,包含248K偏好对和69K指令微调数据的多模态数据集;3)Omni-RewardModel,支持判别与生成式奖励建模。 Result: Omni-RewardModel在Omni-RewardBench及其他主流奖励建模基准上均表现出色,验证了其在多模态与自由偏好建模上的有效性。 Conclusion: Omni-Reward推动了通用多模态奖励模型的发展,为支持自由形式偏好和多模态对齐提供了有效方案。 Abstract: Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

[107] BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

Litu Ou,Kuan Li,Huifeng Yin,Liwen Zhang,Zhongwang Zhang,Xixi Wu,Rui Ye,Zile Qiao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: 本文研究了基于大语言模型的搜索代理在多轮交互中通过言语化置信度评分来表达自身置信度的能力,并提出利用置信度进行测试时扩展的方法,在降低计算开销的同时保持竞争力性能。

Details Motivation: 现有研究主要集中在单轮场景下的模型置信度,缺乏对复杂多轮交互中置信度的研究。 Method: 通过在开源的代理模型上实验,观察模型在不同置信度水平下的任务准确率,并提出基于置信度的测试时扩展(TTS)方法,动态调整推理过程。 Result: 模型在高置信度下任务准确率显著提高,低置信度时接近零;所提TTS方法显著降低了令牌消耗,同时性能优于固定预算的基线方法。 Conclusion: LLM-based agents 能有效表达多轮交互中的置信度,利用该置信度可设计更高效、自适应的推理策略。 Abstract: Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.

[108] Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

Nikesh Gyawali,Doina Caragea,Alex Vasenkov,Cornelia Caragea

Main category: cs.CL

TL;DR: 本文提出了一种针对财务指标(债务、每股收益和销售额)的句子级立场检测语料库,利用ChatGPT-o3-pro模型标注并经人工验证,通过零样本、少样本和思维链提示策略评估大语言模型的表现,结果表明少样本结合思维链效果最优,且无需大量标注数据即可在金融领域实现有效的目标特定立场分析。

Details Motivation: 由于SEC文件和财报电话会议记录篇幅长、术语多、语言复杂,传统情感分析依赖大规模标注数据,难以进行细粒度的句子级立场分析,因此需要一种不依赖大量标注数据的新方法。 Method: 从10-K年报和ECT中提取句子,构建针对三个核心财务指标的立场检测语料库,并使用ChatGPT-o3-pro模型在严格人工验证下进行标注;采用零样本、少样本和思维链提示策略对现代大语言模型进行系统评估。 Result: 少样本结合思维链提示策略表现最佳,优于有监督基线模型;大语言模型在SEC和ECT数据集上的表现存在差异。 Conclusion: 大语言模型在无需大量标注数据的情况下,具备在金融领域进行目标特定立场分析的实际可行性。 Abstract: Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.

[109] MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Tengchao Yang,Sichen Guo,Mengzhao Jia,Jiaming Su,Yuanyang Liu,Zhihan Zhang,Meng Jiang

Main category: cs.CL

TL;DR: 本文提出了MMTutorBench,首个面向AI数学辅导的基准测试,包含685个问题,聚焦教学关键步骤,并通过细粒度评分标准评估多模态大语言模型在三个任务上的表现。

Details Motivation: 现有基准未能充分评估AI在数学辅导中的诊断与引导能力,缺乏对教学关键步骤的精细评价。 Method: 构建了包含685个问题的MMTutorBench,每个问题配有特定评分标准,评估涵盖六个维度,并设计三项任务:洞察发现、操作制定和操作执行。采用12种主流MLLM进行评测,并使用基于评分标准的LLM-as-a-Judge方法进行可靠性验证。 Result: 实验显示专有模型优于开源模型,但整体相比人类导师仍有差距;OCR流程降低辅导质量,少样本提示提升有限,LLM-as-a-Judge评估结果高度可靠。 Conclusion: MMTutorBench具有挑战性和诊断价值,为推进AI数学辅导提供了有效评估工具。 Abstract: Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

[110] M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

Jiahui Geng,Jonathan Tonglet,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文介绍了M4FC,一个包含4982张图片和6980个声明的新多模态事实核查数据集,覆盖十种语言和六项任务,解决了现有数据集的多种局限性,并提供了各任务的基线结果。

Details Motivation: 现有的多模态自动事实核查数据集存在实例少、语言和任务单一、证据泄露或依赖外部新闻来源等问题,亟需更全面、真实且多样化的数据集来推动研究发展。 Method: 构建了一个名为M4FC的新数据集,包含由22个组织的专业事实核查人员验证的真实图像和多语言声明,涵盖六种多模态事实核查任务,并提供所有任务的基线模型结果及中间任务对最终判决预测影响的分析。 Result: M4FC数据集包含4,982张图像和6,980个声明,支持十种语言,覆盖六项任务,实验表明结合中间任务能提升最终判决预测性能。 Conclusion: M4FC是一个多样化、真实且具有挑战性的多模态事实核查数据集,能够有效支持跨语言和多任务的事实核查研究,促进该领域的发展。 Abstract: Existing real-world datasets for multimodal automated fact-checking have multiple limitations: they contain few instances, focus on only one or two languages and tasks, suffer from evidence leakage, or depend on external sets of news articles for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent diverse cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks influence downstream verdict prediction performance. We make our dataset and code available.

[111] IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering

Jieyong Kim,Maryam Amirizaniani,Soojin Yoon,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了核心意图识别的概念,并构建了IPQA基准来评估个性化问答中的核心意图识别能力,发现现有模型在该任务上表现不佳。

Details Motivation: 现有基准未能直接衡量意图识别能力,而理解用户优先考虑的意图对生成满足个体信息需求的回答至关重要。 Method: 基于满意理论,从用户选择答案的行为模式中推导核心意图,通过系统过滤、大语言模型标注以及自动化验证与人工验证结合的质量控制构建数据集。 Result: 实验表明,当前最先进的语言模型在个性化场景下的核心意图识别任务上表现较差,且随着问题复杂度增加性能下降。 Conclusion: 核心意图识别是个性化问答中的关键挑战,IPQA基准为未来研究提供了重要资源。 Abstract: Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.

[112] LimRank: Less is More for Reasoning-Intensive Information Reranking

Tingyu Song,Yilun Zhao,Siyue Zhang,Chen Zhao,Arman Cohan

Main category: cs.CL

TL;DR: 提出了一种基于少量高质量监督数据的LLM信息重排序微调方法,通过自研的合成数据生成管道LIMRANK-SYNTHESIZER训练出高效且泛化性强的重排序模型LIMRANK,在极小数据量下达到具有竞争力的性能。

Details Motivation: 现有方法依赖大规模微调来适应LLM的信息重排序任务,计算成本高昂,因此需要一种更高效、低成本的适配方式。 Method: 设计了一个可复用的开源合成数据生成管道LIMRANK-SYNTHESIZER,用于生成多样化、高挑战性且贴近现实的重排序样本,并利用这些合成数据对LIMRANK模型进行微调。 Result: 在BRIGHT和FollowIR两个高挑战性基准上,LIMRANK仅使用不到以往工作5%的数据量即达到了具有竞争力的性能,消融实验验证了合成数据管道的有效性和模型的强泛化能力。 Conclusion: 研究表明现代大语言模型可通过极少的高质量合成数据有效适配信息重排序任务,LIMRANK及其数据生成管道为低资源场景下的模型微调提供了可行方案。 Abstract: Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.

[113] Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models

Luis Ramos,Hiram Calvo,Olga Kolesnikova

Main category: cs.CL

TL;DR: 该论文比较了传统机器学习模型与微调后的Transformer模型在希望言语识别任务上的表现,结果显示Transformer模型在小规模数据集上具有更优的性能。

Details Motivation: 希望言语识别对于检测社交媒体中激励性表达具有重要意义,但现有方法在捕捉细微语义方面存在局限。 Method: 在已划分的希望言语数据集上评估了传统机器学习模型(如SVM、逻辑回归、朴素贝叶斯)和微调后的Transformer模型的表现。 Result: 在开发集上,传统模型最高达到0.78的macro-F1;Transformer模型最佳表现分别为weighted F1 0.79、macro F1 0.79和准确率0.80。 Conclusion: 尽管传统模型表现良好,但Transformer模型能更好地捕捉希望言语的细微语义,表明大型Transformer和大语言模型在小数据集上仍有潜力。 Abstract: The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. This proposal evaluates traditional machine learning models and fine-tuned transformers for a previously split hope speech dataset as train, development and test set. On development test, a linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM with RBF kernel reached 0.77, and Na\"ive Bayes hit 0.75. Transformer models delivered better results, the best model achieved weighted precision of 0.82, weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80 accuracy. These results suggest that while optimally configured traditional machine learning models remain agile, transformer architectures detect some subtle semantics of hope to achieve higher precision and recall in hope speech detection, suggesting that larges transformers and LLMs could perform better in small datasets.

[114] Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao,Jiaqi Zeng,Julien Veron Vialard,Oleksii Kuchaiev,Jiawei Han,Olivier Delalleau

Main category: cs.CL

TL;DR: 本文提出了分支重思奖励模型(BR-RM),将“再思考”机制引入奖励建模,通过两轮推理减少判断扩散,提升对细微错误的敏感性,在多个基准上达到SOTA性能。

Details Motivation: 现有奖励模型通常将多维质量压缩为单个标量评分,导致判断扩散,分析浅薄。受大模型“再思考”策略启发,作者希望将逐步推理思想引入奖励模型以提升评估质量。 Method: 提出BR-RM,第一轮自适应选择关键评价维度并生成假设,第二轮基于分支进行有针对性的重读与验证;采用GRPO风格的强化学习训练,使用二元奖励和格式约束,兼容标准RLHF流程。 Result: 在三个跨领域奖励建模基准上取得当前最优性能,显著提升对关键错误的检测能力,同时保持可扩展性和实用性。 Conclusion: BR-RM成功将“再思考”机制应用于奖励建模,有效缓解判断扩散问题,实现了更深入、聚焦的评估,为高质量对齐提供了新范式。 Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.

cs.CV [Back]

[115] Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Alexa R. Tartaglini,Satchel Grant,Daniel Wurgaft,Christopher Potts,Judith E. Fan

Main category: cs.CV

TL;DR: 本文提出FUGU任务套件,用于诊断视觉-语言模型(VLM)在理解数据可视化时的失败原因,发现错误主要源于视觉与语言模块间的信息传递问题,且模型存在架构性局限。

Details Motivation: 当前视觉-语言模型在理解数据可视化方面表现不佳,但失败原因尚不清楚,本文旨在精确定位问题来源。 Method: 开发FUGU任务套件,结合激活修补和线性探针技术,分析三种主流VLM在不同提示策略下的信息流动。 Result: 发现部分模型无法正确生成数据点坐标,初始错误导致最终回答错误;正确坐标可从视觉编码器中读出,表明问题出在视觉-语言交接环节;提供正确坐标虽提升单点任务表现,却恶化多点统计任务性能;微调也无法达到理想效果。 Conclusion: 当前VLM在数据可视化理解上存在架构性限制,特别是在视觉与语言模块的信息传递和复杂统计推理方面,需针对性改进。 Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.

[116] Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries

Mihir Gupta,Pratik Desai,Ross Greer

Main category: cs.CV

TL;DR: 提出一种低成本的自洽性框架,通过语义聚类和共识机制提升视觉-语言模型在农业图像描述中的可靠性,结合人类反馈,在PlantVillage数据集上显著优于传统解码方法。

Details Motivation: 在印度、肯尼亚和尼日利亚等发展中国家,由于缺乏植物病理专家、网络连接不可靠以及成本限制,难以部署大规模AI系统进行农作物病害管理。 Method: 采用轻量级预训练嵌入模型(80MB)进行语义聚类,对多个候选输出进行分组,并基于余弦相似度选择包含诊断、症状、分析、治疗和预防建议的最连贯描述;引入用户确认作物类型的人类反馈机制以过滤错误生成。 Result: 在PlantVillage数据集上使用微调的3B参数PaliGemma模型测试,单簇共识方法在10个候选生成时准确率达83.1%,优于贪婪解码的77.5%;考虑前四个聚类时准确率提升至94.0%,优于基线的88.5%。 Conclusion: 该框架能有效提升资源受限环境下农业图像描述的准确性与可靠性,具有实际应用潜力。 Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption -- containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations -- through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework's effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.

[117] Proportion and Perspective Control for Flow-Based Image Generation

Julien Boudier,Hugo Caselles-Dupré

Main category: cs.CV

TL;DR: 本文提出了两种用于艺术控制的ControlNets:一种使用边界框控制物体位置和尺度的比例ControlNet,另一种使用消失线控制场景3D几何结构的透视ControlNet,并通过基于视觉-语言模型的数据管道支持训练。

Details Motivation: 现代文本到图像扩散模型虽能生成高保真图像,但对输出的空间和几何结构控制有限,因此需要更精细的控制方法。 Method: 设计并训练两种专用ControlNets——比例ControlNet和透视ControlNet,结合视觉-语言模型进行标注,并开发专门算法用于条件图像生成。 Result: 实验表明两种模块均能有效控制图像结构,但在处理复杂约束时仍存在局限性。模型已发布在HuggingFace上。 Conclusion: 所提出的ControlNets显著增强了对图像空间和几何结构的控制能力,为艺术创作提供了更强的可控性。 Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research

[118] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows

Harry Zhang,Luca Carlone

Main category: cs.CV

TL;DR: 本文提出了一种名为H2OFlow的新框架,利用合成3D数据学习包含接触、朝向和空间占用的三维人-物交互(HOI)功能,无需人工标注,通过点云上的密集扩散过程实现对真实物体的有效泛化。

Details Motivation: 现有方法依赖昂贵的手工标注数据集,且大多局限于接触式分析,忽略了方向和空间占用等重要交互因素。 Method: 提出H2OFlow框架,采用基于密集3D流的表示方法,通过在点云上进行密集扩散学习,从合成数据中自动学习三维HOI功能。 Result: 实验表明,H2OFlow在建模3D功能方面优于依赖人工标注或基于网格表示的现有方法,并能有效泛化到真实世界物体。 Conclusion: H2OFlow实现了无需人工标注的全面3D HOI功能理解,在接触、朝向和空间占用方面均有提升,展现出强大的实际应用潜力。 Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances -- encompassing contact, orientation, and spatial occupancy -- using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.

[119] OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

Yulong Zhang

Main category: cs.CV

TL;DR: 本文介绍了OCR-Quality,一个包含1000个PDF页面的人工标注数据集,用于评估和开发OCR质量评估方法,涵盖多种真实场景,并提供公开基准。

Details Motivation: 为解决现实应用中缺乏可靠OCR质量评估手段的问题,推动OCR验证系统的发展。 Method: 采集多样化的PDF文档并转换为300 DPI的PNG图像,使用先进的视觉语言模型处理,并通过人工采用四级评分系统进行质量标注。 Result: 构建了一个包含详细来源信息、标注指南和多难度样本的高质量数据集,已在Hugging Face公开发布。 Conclusion: OCR-Quality填补了OCR质量评估领域的空白,为相关研究提供了重要资源和基准。 Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .

[120] Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation

Dawei Dai,Yinxiu Zhou,Chenghang Li,Guolai Jiang,Chengfang Zhang

Main category: cs.CV

TL;DR: 提出Face-MakeUpV2模型,通过引入3D人脸渲染和全局特征通道,在保持身份和物理特征一致性的前提下实现可控的人脸图像生成。

Details Motivation: 现有文本到图像模型在处理局部语义指令时存在人脸属性泄露和物理不一致问题。 Method: 构建大规模数据集FaceCaptionMask-1M,采用预训练文本到图像模型为骨干网络,引入3D人脸渲染通道和全局人脸特征通道,并设计语义对齐和感知损失优化目标。 Result: 实验表明,Face-MakeUpV2在保持人脸ID和物理一致性方面表现最优。 Conclusion: Face-MakeUpV2具有在多样化应用中实现可靠、可控人脸编辑的实用潜力。 Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model's embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.

[121] Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis

Abdelilah Ganmati,Karim Afdel,Lahcen Koutti

Main category: cs.CV

TL;DR: 研究了紧凑二值人脸模板的纵向稳定性,并以每十年多少比特的形式量化了老化漂移。

Details Motivation: 由于人脸识别系统在长期使用中会受到个体老化的影响,导致识别性能下降,因此需要量化这种老化漂移并评估其对系统稳定性的影响。 Method: 使用现代面部CNN生成浮点嵌入,并通过PCA-ITQ将其压缩为64位和128位二值编码;在AgeDB数据集上,对每个至少有三个不同年龄的人脸身份,计算所有真实匹配对之间的汉明距离与年龄差距的线性关系。 Result: 在566个身份中,64位模板的老化漂移中位数为1.357比特/十年,128位模板为2.571比特/十年,且分布主要为正值,表明随着时间推移,类内距离有小幅但系统的增加;此外,漂移随编码长度增加而增大,说明较短的编码在固定决策阈值下更稳定。 Conclusion: 较长的二值编码虽然精度更高,但在面对老化漂移时不如短编码稳定;建议通过定期重新注册或针对不稳定位进行奇偶校正等方法来缓解老化影响。 Abstract: We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-ITQ into 64- and 128-bit codes. For each identity in AgeDB with at least three distinct ages, we form all genuine pairs and fit a per-identity linear model of Hamming distance versus absolute age gap. Across 566 identities, the median slope is 1.357 bits per decade for 64-bit templates and 2.571 bits per decade for 128-bit templates, with tight non-parametric 95 percent bootstrap confidence intervals. The distributions are predominantly positive, indicating a small but systematic increase in intra-class distance over time. Because drift scales with code length, shorter codes are inherently more age-stable at a fixed decision threshold. We connect these slopes to operating characteristics by reporting EER and TPR at FAR = 1 percent in three age bins. We discuss implications for smart-card and match-on-card deployments, including simple mitigations such as periodic re-enrolment and targeted parity on empirically unstable bit positions. Code and CSV artifacts are provided to support reproducibility.

[122] Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection

Bishal Chhetri,B. V. Rathish Kumar

Main category: cs.CV

TL;DR: 提出一种可解释的深度学习框架,用于基于乳腺肿块FNA图像的乳腺癌早期检测,准确率达0.992,并结合SHAP和LIME提升模型可解释性。

Details Motivation: 提高乳腺癌早期诊断的准确性与模型可解释性,促进深度学习在临床中的应用。 Method: 使用ReLU激活、Adam优化器和二元交叉熵损失的深度神经网络,并结合SHAP和LIME等可解释AI技术进行特征归因分析。 Result: 模型准确率为0.992,精确率为1.000,召回率为0.977,F1分数为0.988,性能优于传统机器学习方法;凹点特征被识别为最关键预测因子。 Conclusion: 该框架在保持高性能的同时提升了可解释性,有助于增强临床医生信任,推动其在实际医疗场景中的应用。 Abstract: In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) images of breast masses. Our deep neural network, using ReLU activations, the Adam optimizer, and a binary cross-entropy loss, delivers state-of-the-art classification performance, achieving an accuracy of 0.992, precision of 1.000, recall of 0.977, and an F1 score of 0.988. These results substantially exceed the benchmarks reported in the literature. We evaluated the model under identical protocols against a suite of well-established algorithms (logistic regression, decision trees, random forests, stochastic gradient descent, K-nearest neighbors, and XGBoost) and found the deep model consistently superior on the same metrics. Recognizing that high predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, we incorporated model-agnostic Explainable AI techniques such as SHAP and LIME to produce feature-level attributions and human-readable visualizations. These explanations quantify the contribution of each feature to individual predictions, support error analysis, and increase clinician trust, thus bridging the gap between performance and interpretability for real-world clinical use. The concave points feature of the cell nuclei is found to be the most influential feature positively impacting the classification task. This insight can be very helpful in improving the diagnosis and treatment of breast cancer by highlighting the key characteristics of breast tumor.

[123] EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning

Runchu Donga,Peng Zhao,Guiqin Wang,Nan Qi,Jie Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为EdgeSync的高效边缘模型更新方法,通过引入时效性和推理结果优化样本过滤和训练调度,提升实时视频分析系统的准确性和响应速度。

Details Motivation: 由于光照、天气等因素导致数据分布变化,边缘设备上的轻量级模型精度下降,现有云端协同更新方法存在计算开销大、更新延迟高以及新模型与当前数据分布不匹配的问题。 Method: EdgeSync通过结合时效性和推理结果改进样本筛选机制,并设计动态训练管理模块来优化模型更新的时机与顺序,从而提高更新效率和模型适应性。 Result: 在多个真实复杂数据集上的实验表明,EdgeSync相比现有方法准确率提升约3.4%,相比传统方法提升约10%。 Conclusion: EdgeSync有效缓解了边缘模型更新中的延迟与数据分布偏移问题,显著提升了视频分析系统的性能与实用性。 Abstract: Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factors such as changing lighting and weather conditions, leading to decreased model accuracy. Recent frameworks try to address this issue by leveraging remote servers to continuously train and adapt lightweight edge models using more complex models in the cloud. Despite these advancements, existing methods face two key challenges: first, the retraining process is compute-intensive, causing significant delays in model updates; second, the new model may not align well with the evolving data distribution of the current video stream. To address these challenges, we introduce EdgeSync, an efficient edge-model updating approach that enhances sample filtering by incorporating timeliness and inference results, thus ensuring training samples are more relevant to the current video content while reducing update delays. Additionally, EdgeSync features a dynamic training management module that optimizes the timing and sequencing of model updates to improve their timeliness. Evaluations on diverse and complex real-world datasets demonstrate that EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.

[124] Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance

Emmanuel U. Ugwu,Zhang Xinming

Main category: cs.CV

TL;DR: 本文首次系统评估了SAM2系列模型在火灾分割任务中的表现,重点研究了基于边界框提示策略的可行性,发现Box+MP提示策略效果最佳,轻量级模型如TinySAM和MobileSAM在边缘设备部署中更具优势。

Details Motivation: 由于火焰具有不规则边界、透明边缘和强度多变等特点,火灾分割在计算机视觉中仍具挑战性;尽管SAM系列模型在跨域任务中表现优异,但其在火灾分割尤其是移动端部署场景下的应用尚未充分探索。 Method: 系统评估了四种SAM2.1变体(tiny、small、base_plus、large)及面向移动部署的变体(TinySAM、MobileSAM),在三个火灾数据集上测试了自动提示、单正点、单正点+单负点、多正点、边界框及混合提示(Box+SP、Box+MP)等多种提示策略。 Result: 边界框提示策略整体优于自动和单点提示,其中Box+MP在Khan数据集上取得最高平均IoU(0.64)和Dice系数(0.75);轻量级模型显著降低内存与计算开销,更适合边缘设备部署。 Conclusion: 本研究为可提示分割模型在火灾监测系统中的部署提供了关键见解,并建立了面向特定领域SAM应用的基准,推动后续研究。 Abstract: Fire segmentation remains a critical challenge in computer vision due to flames' irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (SAM and SAM2) have demonstrated impressive cross-domain generalization capabilities, their effectiveness in fire segmentation -- particularly under mobile deployment constraints -- remains largely unexplored. This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, focusing on bounding box prompting strategies to enhance deployment feasibility. We systematically evaluate four SAM2.1 variants (tiny, small, base_plus, large) alongside mobile-oriented variants (TinySAM, MobileSAM) across three fire datasets using multiple prompting strategies: automatic, single positive point (SP), single positive point + single negative point (SP+SN), multiple positive points (MP), bounding box (Box), and hybrid variants (Box+SP and Box+MP). Our experimental results demonstrate that bounding box prompts consistently outperform automatic and single point-based approaches, with Box+MP achieving the highest mean IoU (0.64) and Dice coefficient (0.75) on the Khan dataset. Lightweight variants such as TinySAM and MobileSAM further reduce memory and computational costs, making them more suitable for latency-tolerant edge scenarios. Overall, this work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications. Code is available at: https://github.com/UEmmanuel5/ProFSAM

[125] Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

Guo Li,Yuyang Yu,Xuemiao Xu

Main category: cs.CV

TL;DR: 提出一种针对扩散模型的高效成员推断攻击方法,通过注入微小噪声并分析噪声分布的聚合程度来判断样本是否属于训练集。

Details Motivation: 扩散模型在生成高质量图像方面表现出色,但其广泛应用带来了隐私风险,特别是成员推断攻击可能泄露训练数据信息。 Method: 通过在特定时间步注入轻微噪声,并评估模型预测噪声分布的聚合程度,利用成员样本与非成员样本在噪声预测模式上的差异进行攻击。 Result: 该方法在多个数据集上表现优异,在ASR和AUC指标上对大规模文本到图像扩散模型也展现出更好的攻击效果,且查询次数更少。 Conclusion: 所提方法高效、可扩展,能有效实施针对扩散模型的成员推断攻击,凸显了扩散模型在隐私保护方面的潜在风险。 Abstract: Diffusion models have demonstrated powerful performance in generating high-quality images. A typical example is text-to-image generator like Stable Diffusion. However, their widespread use also poses potential privacy risks. A key concern is membership inference attacks, which attempt to determine whether a particular data sample was used in the model training process. We propose an efficient membership inference attack method against diffusion models. This method is based on the injection of slight noise and the evaluation of the aggregation degree of the noise distribution. The intuition is that the noise prediction patterns of diffusion models for training set samples and non-training set samples exhibit distinguishable differences.Specifically, we suppose that member images exhibit higher aggregation of predicted noise around a certain time step of the diffusion process. In contrast, the predicted noises of non-member images exhibit a more discrete characteristic around the certain time step. Compared with other existing methods, our proposed method requires fewer visits to the target diffusion model. We inject slight noise into the image under test and then determine its membership by analyzing the aggregation degree of the noise distribution predicted by the model. Empirical findings indicate that our method achieves superior performance across multiple datasets. At the same time, our method can also show better attack effects in ASR and AUC when facing large-scale text-to-image diffusion models, proving the scalability of our method.

[126] Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound

Arun Muthukkumar

Main category: cs.CV

TL;DR: 提出了一种基于可微渲染器的相机位姿估计协方差闭式下界方法,通过流形上的小位姿扰动线性化图像形成过程,得到渲染感知的Cramér-Rao界,适用于多摄像头融合与协同感知等任务。

Details Motivation: 现有密集或学习模型下的位姿估计缺乏严格的不确定性量化方法,需要一种能够结合可微渲染与经典视觉理论的统计框架。 Method: 将可微渲染器视为测量函数,在位姿流形上对图像形成过程进行线性化,推导出渲染感知的Cramér-Rao下界,并通过Fisher信息融合扩展到多摄像头场景。 Result: 该方法在理论上与经典束调整不确定性一致,并能自然推广到多智能体设置,无需显式关键点对应即可支持协同感知和新视角合成。 Conclusion: 所提出的统计框架为基于学习的位姿估计提供了可靠的不确定性量化工具,兼具理论连续性与实际应用扩展性。 Abstract: Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed-form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render-aware Cram\'er-Rao bound. Our approach reduces to classical bundle-adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi-agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.

[127] EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction

Qile Su,Shoutai Zhu,Shuai Zhang,Baoyu Liang,Chao Tong

Main category: cs.CV

TL;DR: 本文提出了一个名为AVEP(以动作为中心的视频事件预测)的新任务,旨在基于上下文预测后续事件,并构建了一个包含约35K个标注视频和178K个视频片段的大规模数据集。为应对复杂事件结构,作者设计了EventFormer模型,采用节点-图分层注意力机制,能够捕捉事件间及其论元间的关系。实验表明该任务的挑战性和数据集的价值,且所提方法优于现有视频预测模型。

Details Motivation: 现有的脚本事件推理研究主要集中在文本形式的人类事件,而大多数人类事件是以视频形式记录的,但在视觉领域缺乏相关研究。因此,需要一种能够处理视频中复杂逻辑和丰富语义信息的事件预测方法。 Method: 提出AVEP任务和大规模结构化数据集,采用节点-图分层注意力机制的EventFormer模型,将多模态事件论元作为基本单元进行建模,捕捉事件与论元之间的关系及论元间的共指关系。 Result: 在AVEP任务上进行了多种SOTA视频预测模型和LVLMs的实验,结果表明任务具有挑战性且数据集有价值;EventFormer模型性能优于所有基线模型。 Conclusion: AVEP是一个具有挑战性的新视频事件预测任务,配套的数据集和提出的EventFormer模型有效推动了该领域的发展,未来将公开数据和代码以促进进一步研究。 Abstract: Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.

[128] Mismatch reconstruction theory for unknown measurement matrix in imaging through multimode fiber bending

Le Yang

Main category: cs.CV

TL;DR: 本文提出了一种用于多模光纤成像中测量矩阵未知情况下的错配重构理论,通过设计匹配与校准算法构建新的测量矩阵,并验证了其在低噪声下可成功重构图像,算法具有一定的鲁棒性。

Details Motivation: 在实际应用中,由于系统配置未知或光纤弯曲导致无法实时对齐,测量矩阵常难以获取,导致传统重构算法失效,因此需要解决测量矩阵未知时的图像重构问题。 Method: 提出了错配方程,设计了匹配和校准求解算法,用于构造新的测量矩阵,并在附录中给出了详细的理论证明。 Result: 实验结果表明,在低噪声条件下,所构建的矩阵可用于传统重构算法并成功恢复原始图像;进一步分析显示算法对噪声、计算精度和正交性具有一定鲁棒性。 Conclusion: 该错配重构理论有效解决了多模光纤成像中测量矩阵未知的问题,具有实际应用潜力,同时讨论了其局限性与未来应用方向。 Abstract: Multimode fiber imaging requires strict matching between measurement value and measurement matrix to achieve image reconstruction. However, in practical applications, the measurement matrix often cannot be obtained due to unknown system configuration or difficulty in real-time alignment after arbitrary fiber bending, resulting in the failure of traditional reconstruction algorithms. This paper presents a novel mismatch reconstruction theory for solving the problem of image reconstruction when measurement matrix is unknown. We first propose mismatch equation and design matched and calibration solution algorithms to construct a new measurement matrix. In addition, we also provide a detailed proof of these equations and algorithms in the appendix. The experimental results show that under low noise levels, constructed matrix can be used for matched pair in traditional reconstruction algorithms, and reconstruct the original image successfully. Then, we analyze the impact of noise, computational precision and orthogonality on reconstruction performance. The results show that proposed algorithms have a certain degree of robustness. Finally, we discuss the limitations and potential applications of this theory. The code is available: https://github.com/yanglebupt/mismatch-solution.

[129] Exploring the design space of diffusion and flow models for data fusion

Niraj Chaudhari,Manmeet Singh,Naveen Sudharsan,Amit Kumar Srivastava,Harsh Kamath,Dushyant Mahajan,Ayan Paul

Main category: cs.CV

TL;DR: 本研究探索了扩散模型和流模型在卫星遥感夜间灯光数据融合中的应用,发现基于UNet的扩散模型在保持空间细节和生成高质量融合图像方面表现优异,并提供了噪声调度策略和量化技术的优化建议。

Details Motivation: 为了提升多源卫星遥感数据(如DMSP-OLS和VIIRS夜间灯光数据)融合的质量,解决空间与时间分辨率不足的问题。 Method: 采用多种2D图像到图像生成模型(包括UNet、扩散模型和流模型),系统评估其在数据融合中的性能,并研究不同噪声调度器和量化技术对模型效率和重建质量的影响。 Result: 基于UNet的扩散模型在保留细粒度空间信息和生成高保真融合图像方面优于其他架构;迭代求解器可加速推理,而离散调度器能实现更高质量重建;量化技术可在不牺牲性能的前提下降低计算和内存开销。 Conclusion: 扩散模型特别是基于UNet的架构是遥感数据融合的有效选择,合理的噪声调度和量化策略可显著提升融合效果与计算效率。 Abstract: Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sensing, where fusing multi-sensor observations can improve spatial and temporal resolution. In this study, we explore the design space of diffusion and flow models for data fusion, focusing on the integration of Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) and Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights data. Our approach leverages a diverse set of 2D image-to-image generative models, including UNET, diffusion, and flow modeling architectures. We evaluate the effectiveness of these architectures in satellite remote sensing data fusion, identifying diffusion models based on UNet as particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. We also provide guidance on the selection of noise schedulers in diffusion-based models, highlighting the trade-offs between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions. Additionally, we explore quantization techniques to optimize memory efficiency and computational cost without compromising performance. Our findings offer practical insights into selecting the most effective diffusion and flow model architectures for data fusion tasks, particularly in remote sensing applications, and provide recommendations for leveraging noise scheduling strategies to enhance fusion quality.

[130] 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection

Usman Ali,Ali Zia,Abdul Rehman,Umer Ramzan,Zohaib Hassan,Talha Sattar,Jing Wang,Wei Xiang

Main category: cs.CV

TL;DR: 提出了一种新的无监督多模态融合框架MAFR,用于工业异常检测,结合RGB图像和点云数据,通过注意力机制和重建误差实现异常定位,在多个基准上达到SOTA性能。

Details Motivation: 工业异常检测中2D和3D数据的融合具有潜力,但跨模态融合仍具挑战性,尤其在缺乏标注数据的情况下需提升鲁棒性和准确性。 Method: 设计共享融合编码器构建统一潜在空间,采用注意力引导的模态特异性解码器进行特征恢复,通过计算输入与重建之间的差异来定位异常。 Result: 在MVTec 3D-AD和Eyecandies数据集上分别取得0.972和0.901的平均I-AUROC,表现出优越的检测性能和少样本学习能力。 Conclusion: MAFR为融合视觉与几何信息提供了有效且原理清晰的方法,显著提升了工业异常检测的鲁棒性与精度。 Abstract: Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at https://github.com/adabrh/MAFR

[131] Token-Level Inference-Time Alignment for Vision-Language Models

Kejia Chen,Jiawen Zhang,Jiacong Hu,Kewei Gao,Jian Lou,Zunlei Feng,Mingli Song

Main category: cs.CV

TL;DR: 提出了一种名为TITA的轻量级推理时对齐框架,通过在token级别提供密集反馈来减少视觉-语言模型的幻觉问题,无需重新训练主干模型。

Details Motivation: 现有对齐方法依赖昂贵的微调或粗粒度的序列级反馈,难以有效缓解VLM的幻觉问题。 Method: 冻结基础VLM,训练一个奖励模型来近似其分布,并利用推理时的对数概率比作为隐式偏好信号,实现token级别的细粒度反馈。 Result: 在LLaVA等多个大型VLM上实验表明,TITA显著提升了性能,在MMVet上提升8.6%,POPE上提升6.7%,并有效减少幻觉,且推理开销极小。 Conclusion: TITA是一种高效、通用的推理时对齐方法,能够在不微调主干模型的前提下,通过token级反馈增强VLM的准确性和可靠性。 Abstract: Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.

[132] Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention

Yinbo Sun,Yuchen Fang,Zhibo Zhu,Jia Li,Yu Liu,Qiwen Deng,Jun Zhou,Hang Yu,Xingyu Lu,Lintao Ma

Main category: cs.CV

TL;DR: 提出了一种新的时间序列基础模型架构HIBA,通过层次化的块间和块内稀疏注意力机制有效捕捉多尺度时间依赖,在零样本迁移任务中表现出卓越性能。基于该架构的Xihe模型在参数效率和泛化能力上均达到先进水平。

Details Motivation: 现有时间序列基础模型直接采用跨领域架构(如语言模型),难以有效捕捉时间序列中的多尺度时序依赖,尤其在不同数据集间的零样本迁移表现受限。 Method: 提出层次交错块注意力(HIBA),结合块内注意力(局部信息交换)和块间注意力(全局时序模式交互),实现对多尺度依赖的有效建模,并构建了从9.5M到1.5B参数的可扩展Xihe模型系列。 Result: 在GIFT-Eval基准测试中,最小的Xihe-tiny(9.5M)超越多数现有TSFM,而最大的Xihe-max(1.5B)在零样本设置下取得显著领先的最先进性能。 Conclusion: HIBA架构在广泛参数范围内展现出卓越的泛化能力和架构优势,验证了其在时间序列建模中的有效性与可扩展性。 Abstract: The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA architecture, we introduce Xihe, a scalable TSFM family spanning from an ultra-efficient 9.5M parameter configuration to high-capacity 1.5B variant. Evaluated on the comprehensive GIFT-Eval benchmark, our most compact Xihe-tiny model (9.5M) surpasses the majority of contemporary TSFMs, demonstrating remarkable parameter efficiency. More impressively, Xihe-max (1.5B) establishes new state-of-the-art zero-shot performance, surpassing previous best results by a substantial margin. This consistent performance excellence across the entire parameter spectrum provides compelling evidence for the exceptional generalization capabilities and architectural superiority of HIBA.

[133] AI-Boosted Video Annotation: Assessing the Process Enhancement

Juan Gutiérrez,Ángel Mora,Pablo Regodón,Silvia Rodriguez,José Luis Blanco

Main category: cs.CV

TL;DR: 本研究通过集成AI驱动的零样本预标注,探索了人在回路中的视频标注优化方法,显著提升了标注效率和一致性。

Details Motivation: 为了减轻人工标注负担并提升视频标注的质量与效率,研究探讨了AI辅助标注的实际应用潜力。 Method: 采用Label Studio平台结合AI驱动的零样本预标注,构建单轮迭代标注框架,并在UCF-Crime数据集上测试正常与异常活动的区分效果。 Result: 使用预标注后,70%的标注者标注时间减少了35%,标注质量保持相当;且标注结果在不同标注者间更一致,更符合视频帧的自然聚类。 Conclusion: AI辅助的预标注能有效优化人在回路的视频标注流程,提高效率、一致性和整体标注质量。 Abstract: We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into the practical implications of the annotation processes, the integration of AI components, and the evaluation of its outcomes. We analyze their impact on efficiency, accuracy, and overall annotation quality. Focusing on the Human-in-the-Loop for video annotation tasks, we implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations. Using this framework, we designed a test based on the annotation of the UCF-Crime dataset to discriminate between normal and abnormal activities in video footage. Our results evidence how automatic AI-based pre-annotation can streamline the video annotation workflow, empowering human annotators and optimizing the overall pipeline. Using the pre-annotated data, we observed a 35% reduction in the annotation time for 70% of the annotators with similar quality annotations, compared to the traditional manual annotation task. Results are consistent with asset duration and complexity. We also observed that while annotators rapidly learned to use the tool, the produced annotations are more coherent among annotators and better match the natural clustering of the video frames.

[134] Morphology-Aware KOA Classification: Integrating Graph Priors with Vision Models

Marouane Tliba,Mohamed Amine Kerkouri,Yassine Nasser,Nour Aburaed,Aladine Chetouani,Ulas Bagci,Rachid Jennane

Main category: cs.CV

TL;DR: 提出一种结合解剖结构与X光特征的多模态框架,通过图表示与视觉编码器融合,显著提升膝骨关节炎(KOA)分类准确率。

Details Motivation: 传统深度学习模型难以捕捉X光片中细微的形态学变化,导致KOA诊断困难。 Method: 利用SAM分割生成解剖结构图,构建形态学图表示,并与视觉编码器结合,通过最大化互信息对齐图嵌入与影像特征。 Result: 在Osteoarthritis Initiative数据集上,准确率接近80%,较单模态基线提升10%,优于现有SOTA方法(+8%准确率,+11% F1分数)。 Conclusion: 将解剖结构先验融入放射学分析对KOA严重程度分级至关重要。 Abstract: Knee osteoarthritis (KOA) diagnosis from radiographs remains challenging due to the subtle morphological details that standard deep learning models struggle to capture effectively. We propose a novel multimodal framework that combines anatomical structure with radiographic features by integrating a morphological graph representation - derived from Segment Anything Model (SAM) segmentations - with a vision encoder. Our approach enforces alignment between geometry-informed graph embeddings and radiographic features through mutual information maximization, significantly improving KOA classification accuracy. By constructing graphs from anatomical features, we introduce explicit morphological priors that mirror clinical assessment criteria, enriching the feature space and enhancing the model's inductive bias. Experiments on the Osteoarthritis Initiative dataset demonstrate that our approach surpasses single-modality baselines by up to 10\% in accuracy (reaching nearly 80\%), while outperforming existing state-of-the-art methods by 8\% in accuracy and 11\% in F1 score. These results underscore the critical importance of incorporating anatomical structure into radiographic analysis for accurate KOA severity grading.

[135] It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps

Pedro Cisneros-Velarde

Main category: cs.CV

TL;DR: 提出一种使用两个并行采样器在有限去噪步数下提升扩散模型生成质量的简单且即插即用的方法,无需额外训练,但增加更多采样器不一定有效。

Details Motivation: 在扩散模型推理步骤受限的情况下,如何在不增加训练成本的前提下提升生成图像质量。 Method: 使用两个并行的采样器在连续时间步进行去噪,并在潜在空间中融合它们的信息,实现简单且模型无关的集成策略。 Result: 在多种扩散模型上通过自动和人工评估验证了方法的有效性,发现信息融合方式对结果至关重要,简单的融合反而会降低质量,且增加更多采样器未带来进一步提升。 Conclusion: 所提双采样器并行方法能有效提升有限步数下的生成质量,具有通用性和实用性,但扩展至更多采样器需谨慎设计融合机制。 Abstract: We consider the situation where we have a limited number of denoising steps, i.e., of evaluations of a diffusion model. We show that two parallel processors or samplers under such limitation can improve the quality of the sampled image. Particularly, the two samplers make denoising steps at successive times, and their information is appropriately integrated in the latent image. Remarkably, our method is simple both conceptually and to implement: it is plug-&-play, model agnostic, and does not require any additional fine-tuning or external models. We test our method with both automated and human evaluations for different diffusion models. We also show that a naive integration of the information from the two samplers lowers sample quality. Finally, we find that adding more parallel samplers does not necessarily improve sample quality.

[136] Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval

Jiaao Yu,Mingjie Han,Tao Gong,Jian Zhang,Man Lan

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的简洁训练框架FDA-CLIP,用于文本-视频对齐,通过帧差生成动态区域掩码,并作为额外的Alpha通道输入Alpha-CLIP,有效提升动态特征建模并抑制静态冗余。

Details Motivation: 早期文本-视频检索方法依赖大量标注数据且存在模态鸿沟,现有CLIP适配方法缺乏对动态视频特征的增强和静态冗余的抑制。 Method: 提出FDA-CLIP框架,利用帧差生成动态区域掩码,作为Alpha通道输入Alpha-CLIP,引导模型关注关键动态区域并抑制静态背景。 Result: 实验证明,该方法在保持检索效率的同时,显著提升了文本-视频检索的准确性。 Conclusion: FDA-CLIP通过引入帧差引导的动态特征增强机制,有效改善了跨模态对齐性能,为低成本、高精度的视频-文本检索提供了新思路。 Abstract: With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.

[137] Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

Jiaao Yu,Shenwei Li,Mingjie Han,Yifei Yin,Wenzheng Song,Chenghao Jia,Man Lan

Main category: cs.CV

TL;DR: 本文提出了一种新的微调任务“基于上下文和常识的掩码预测”(MPCC),旨在提升视觉语言模型在多模态场景中的泛化推理能力,并通过新构建的评估基准MPCC Eval验证了所提出的强化微调与先验采样方法的有效性。

Details Motivation: 现有推理模型主要集中在单模态语言任务上,在真实多模态场景中(尤其是视觉-语言任务)适应能力有限,且现有方法未能充分利用视觉上下文和常识知识,限制了推理能力的泛化。 Method: 提出了新的微调任务MPCC(Masked Prediction via Context and Commonsense),要求模型利用视觉上下文和常识推理恢复被遮挡图像中的语义内容;并引入强化微调与先验采样(Reinforcement Fine-tuning with Prior Sampling)方法进行训练。同时构建了专用评估基准MPCC Eval。 Result: 所提方法在MPCC Eval基准上显著提升了模型在OOD(分布外)和跨任务场景下的推理性能,表明其有效增强了视觉语言模型的泛化推理能力。 Conclusion: 通过结合视觉上下文与常识推理的掩码预测任务和新的强化微调策略,能够有效推动视觉语言模型在复杂多模态环境中的通用推理能力发展。 Abstract: Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.

[138] Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning

Jiaao Yu,Mingjie Han,Jinkun Jiang,Junyu Dong,Tao Gong,Man Lan

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的语义关系增强框架SRE-CLIP,用于解决领域自适应零样本学习中的跨类别知识迁移和跨模态对齐问题,在I2AwA和I2WebV基准上达到最先进性能。

Details Motivation: 现有方法在数据受限场景下难以平衡跨域迁移与跨类别泛化,且未能充分挖掘CLIP等视觉语言模型在领域自适应零样本学习(DAZSL)中的潜力。 Method: 提出SRE-CLIP Adapter框架,引入语义关系结构损失以提升跨类别知识迁移效率,并设计跨模态对齐保持策略,缓解目标域微调过程中的模态对齐退化问题。 Result: 作为首个基于CLIP的DAZSL方法,SRE-CLIP在I2AwA和I2WebV基准上显著优于现有方法,实现了最先进的性能。 Conclusion: SRE-CLIP有效解决了DAZSL中语义关系利用不足和跨模态对齐退化的问题,充分释放了CLIP在该任务中的潜力,推动了数据高效学习的发展。 Abstract: The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross-category generalization, giving rise to the demand for Domain-Adaptive Zero-Shot Learning (DAZSL). Although vision-language models (e.g., CLIP) have inherent advantages in the DAZSL field, current studies do not fully exploit their potential. Applying CLIP to DAZSL faces two core challenges: inefficient cross-category knowledge transfer due to the lack of semantic relation guidance, and degraded cross-modal alignment during target domain fine-tuning. To address these issues, we propose a Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework, integrating a Semantic Relation Structure Loss and a Cross-Modal Alignment Retention Strategy. As the first CLIP-based DAZSL method, SRE-CLIP achieves state-of-the-art performance on the I2AwA and I2WebV benchmarks, significantly outperforming existing approaches.

[139] Embodied Navigation with Auxiliary Task of Action Description Prediction

Haru Kondoh,Asako Kanezaki

Main category: cs.CV

TL;DR: 提出将动作描述作为强化学习中的辅助任务,通过知识蒸馏从预训练的视觉-语言模型中提取知识,实现可解释且高性能的多模态导航。

Details Motivation: 现有可解释导航系统性能受限,且缺乏用于动作描述的真值数据,难以在强化学习中引入动作描述任务。 Method: 将动作描述作为强化学习的辅助任务,利用预训练的视觉-语言模型通过知识蒸馏提供监督信号,从而在无真值描述数据的情况下进行训练。 Result: 在多种导航任务中实现了高性能导航与动作描述能力,并在语义音视频导航任务上达到当前最优性能。 Conclusion: 该方法能够在不牺牲性能的前提下提升系统的可解释性,推动了可解释且高效的多模态机器人导航的发展。 Abstract: The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.

[140] Hybrid Deep Learning Framework for Enhanced Diabetic Retinopathy Detection: Integrating Traditional Features with AI-driven Insights

Arpan Maity,Aviroop Pal,MD. Samiul Islam,Tamal Ghosh

Main category: cs.CV

TL;DR: 提出一种结合传统特征提取和深度学习的混合诊断框架,用于提高糖尿病视网膜病变的早期检测准确性。

Details Motivation: 糖尿病视网膜病变在早期无症状,若不及时筛查可能导致不可逆的视力丧失,尤其是在糖尿病患者众多的印度等地区,急需高效的筛查方法。 Method: 结合手工特征提取与深度学习模型,利用眼底图像中的临床标志物和自动学习的层次化特征进行多模态融合分析。 Result: 该混合模型在分类性能上优于单一的深度学习方法,显著减少了假阴性率,提升了检测的准确性和可解释性。 Conclusion: 所提出的多模态AI方法能够实现可扩展且精准的糖尿病视网膜病变筛查,适用于糖尿病负担较重的地区。 Abstract: Diabetic Retinopathy (DR), a vision-threatening complication of Dia-betes Mellitus (DM), is a major global concern, particularly in India, which has one of the highest diabetic populations. Prolonged hyperglycemia damages reti-nal microvasculature, leading to DR symptoms like microaneurysms, hemor-rhages, and fluid leakage, which, if undetected, cause irreversible vision loss. Therefore, early screening is crucial as DR is asymptomatic in its initial stages. Fundus imaging aids precise diagnosis by detecting subtle retinal lesions. This paper introduces a hybrid diagnostic framework combining traditional feature extraction and deep learning (DL) to enhance DR detection. While handcrafted features capture key clinical markers, DL automates hierarchical pattern recog-nition, improving early diagnosis. The model synergizes interpretable clinical data with learned features, surpassing standalone DL approaches that demon-strate superior classification and reduce false negatives. This multimodal AI-driven approach enables scalable, accurate DR screening, crucial for diabetes-burdened regions.

[141] Comparative Analysis of Object Detection Algorithms for Surface Defect Detection

Arpan Maity,Tamal Ghosh

Main category: cs.CV

TL;DR: 本文比较了六种主流目标检测算法在NEU-DET表面缺陷检测数据集上的性能,结果表明YOLOv11在准确率和速度上均显著优于其他方法,尤其适用于工业质量控制中的金属表面缺陷检测。

Details Motivation: 为了提升工业质量控制中表面缺陷检测的准确性和效率,需要评估并比较当前主流目标检测算法在真实工业场景下的性能表现。 Method: 在NEU-DET数据集上对比YOLOv11、RetinaNet、Fast R-CNN、YOLOv8、RT-DETR和DETR六种算法,评估其在检测精度、速度和不同缺陷类型(如划痕、夹杂、轧入氧化皮)上的鲁棒性。 Result: YOLOv11在平均准确率上比其他算法高出70%,得益于其增强的特征提取能力、单次前向传播处理机制以及改进的锚框生成和更深的卷积层结构,在缺陷定位上更为精确。 Conclusion: YOLOv11在NEU-DET数据集上展现出卓越的检测性能,是目前最有效的表面缺陷检测模型,显著优于其他竞争算法。 Abstract: This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset, comprising images representing various metal surface defects, a crucial application in industrial quality control. Each model's performance was assessed regarding detection accuracy, speed, and robustness across different defect types such as scratches, inclusions, and rolled-in scales. YOLOv11, a state-of-the-art real-time object detection algorithm, demonstrated superior performance compared to the other methods, achieving a remarkable 70% higher accuracy on average. This improvement can be attributed to YOLOv11s enhanced feature extraction capabilities and ability to process the entire image in a single forward pass, making it faster and more efficient in detecting minor surface defects. Additionally, YOLOv11's architecture optimizations, such as improved anchor box generation and deeper convolutional layers, contributed to more precise localization of defects. In conclusion, YOLOv11's outstanding performance in accuracy and speed solidifies its position as the most effective model for surface defect detection on the NEU dataset, surpassing competing algorithms by a substantial margin.

[142] SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling

Samuel J. Barrett,Docko Sow

Main category: cs.CV

TL;DR: SITS-DECO是一种面向地球观测数据的生成式基础模型,采用GPT风格的纯解码器架构,通过统一序列建模和符号提示实现多任务、多模态卫星影像时间序列分析,在无需任务特定适配的情况下,在作物分类等任务上超越更大模型,展示了数据中心化建模范式的潜力。

Details Motivation: 现有地球观测基础模型通常依赖特定数据源或训练方式,需额外适配才能用于下游任务,缺乏灵活性和通用性。受大语言模型中统一序列建模的启发,本文旨在探索一种更灵活、通用的EO建模范式,通过简单架构和多样化训练数据提升模型泛化能力。 Method: 提出SITS-DECO模型,将卫星影像时间序列(SITS)转换为统一的符号序列,采用GPT风格的纯解码器架构进行自回归生成建模。通过符号提示(symbolic prompting)机制支持多种监督与自监督任务,无需任务或模态特定的结构修改。模型在无空间上下文信息的情况下,仅依赖时序建模完成像素级、多时相、多模态的作物类型分类等任务。 Result: SITS-DECO在PASTIS-R数据集上的作物类型分类任务中表现优于许多更大的地球观测基础模型,验证了其有效性。尽管模型结构简单且不利用空间信息,但凭借对密集时间序列的建模能力,在多任务和多模态场景下展现出强大性能,证明了时序建模在EO任务中的关键作用。 Conclusion: 该研究表明,通过统一序列建模和数据驱动的设计,即使使用简单架构也能实现强大的多任务地球观测能力。SITS-DECO为未来生成式EO基础模型提供了轻量、实用的路径,并推动了以数据为中心的建模范式发展。 Abstract: Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require additional adaptation before they can be used and are structured rigidly around particular data sources or training approaches. To address this, we take inspiration from large language models, where diverse tasks, both pre-training and downstream, are implicitly captured through next-token prediction over unified token sequences, leveraging the structure and diversity of the training data. We introduce SITS-DECO (Satellite Image Time Series-DECoder Only), a proof-of-concept generative model that applies this unified-sequence framing to EO data. Using a simple GPT-style decoder-only architecture, and demonstrate its ability to perform useful EO tasks (pixel-wise, multi-temporal, multi-modal crop-type classification) in a purely generative framework. Through symbolic prompting, we show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture, without task- or modality-specific adaptation. Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification (PASTIS-R) demonstrating that dense temporal sequence modelling is a critical missing ingredient in the current paradigm. This work exemplifies a data-centric modelling paradigm in which capability arises from the diversity and structure of the training data rather than from architectural complexity. SITS-DECO provides a lightweight, practical route to multi-modal, multi-task EO modelling, and a conceptual bridge toward future generative EO foundation models.

[143] Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding

Zhuoming Li,Aitong Liu,Mengxi Jia,Tengxiang Zhang,Dell Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了Gestura,一种基于大型视觉-语言模型的端到端自由手势理解系统,通过引入关键点处理模块和链式思维推理策略,提升了识别精度与语义理解能力,并发布了首个开源自由手势意图理解数据集。

Details Motivation: 现有自由手势理解方案GestureGPT存在识别准确率低和响应速度慢的问题,亟需更高效、鲁棒的解决方案。 Method: Gestura采用预训练的大型视觉-语言模型(LVLM),结合引入解剖学先验知识的关键点处理模块以捕捉细微手部动作,并利用链式思维(CoT)推理策略实现逐步语义推断。 Result: 该系统显著提升了对模糊或非常规手势的理解能力,实现了鲁棒且可适应的自由手势理解,并发布了包含超过30万标注问答对的首个开源自由手势意图理解数据集。 Conclusion: Gestura通过融合领域先验知识与深度语义推理,在自由手势理解任务上优于现有方法,为未来人机交互提供了更自然、灵活的技术路径。 Abstract: Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs' lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model's ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.

[144] Prompt fidelity of ChatGPT4o / Dall-E3 text-to-image visualisations

Dirk HR Spennemann

Main category: cs.CV

TL;DR: 该研究评估了ChatGPT4o/DALL-E3生成的文本到图像可视化中提示属性的忠实度,发现15.6%的属性未正确渲染,尤其在人物年龄等个人特征上偏差最大。

Details Motivation: 探究DALL-E3在文本到图像生成中的提示忠实度,识别模型在不同属性类别上的表现差异及其潜在偏见。 Method: 使用两个公开数据集(共430张图像),分析生成图像在年龄、发型、着装、眼镜、工牌和记事板等属性上与提示的一致性。 Result: DALL-E3在15.6%的属性(n=710)上偏离提示, paraphernalia错误最低,外貌次之,人物自身特征(尤其是年龄)错误最高。 Conclusion: 存在可测量的提示-图像忠实度差距,这对偏见检测和模型评估具有重要意义。 Abstract: This study examines the prompt fidelity of ChatGPT4o / DALL-E3 text-to-image visualisations by analysing whether attributes explicitly specified in autogenously generated prompts are correctly rendered in the resulting images. Using two public-domain datasets comprising 200 visualisations of women working in the cultural and creative industries and 230 visualisations of museum curators, the study assessed accuracy across personal attributes (age, hair), appearance (attire, glasses), and paraphernalia (name tags, clipboards). While correctly rendered in most cases, DALL-E3 deviated from prompt specifications in 15.6% of all attributes (n=710). Errors were lowest for paraphernalia, moderate for personal appearance, and highest for depictions of the person themselves, particularly age. These findings demonstrate measurable prompt-to-image fidelity gaps with implications for bias detection and model evaluation.

[145] Wavelet-based GAN Fingerprint Detection using ResNet50

Sai Teja Erukude,Suhasnadh Reddy Veluru,Viswa Chaitanya Marella

Main category: cs.CV

TL;DR: 提出一种基于小波变换的检测方法,利用离散小波变换(DWT)预处理和ResNet50分类网络,有效区分StyleGAN生成的图像与真实图像,实验表明该方法在Haar和Daubechies小波下的准确率分别为93.8%和95.1%,显著优于空间域模型(81.5%)。

Details Motivation: 生成对抗网络(GAN)生成的图像在数字图像取证中难以识别,亟需有效的检测方法以应对深度伪造带来的安全威胁。 Method: 采用离散小波变换(DWT)对图像进行预处理,使用Haar和Daubechies小波滤波器提取多分辨率表示,并输入ResNet50网络进行分类,同时与直接在空间域训练的ResNet50模型进行对比。 Result: Haar和Daubechies小波预处理模型分别达到93.8%和95.1%的准确率,显著高于空间域模型的81.5%,其中Daubechies表现更优,表明频域特征能更好捕捉GAN生成图像的细微伪影。 Conclusion: GAN生成图像在小波域具有独特的“指纹”特征,基于小波分析的方法能有效提升生成图像的检测性能,为未来深度伪造检测系统的发展提供了可行方向。 Abstract: Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or "fingerprints." The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.

[146] Explainable Deep Learning in Medical Imaging: Brain Tumor and Pneumonia Detection

Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru

Main category: cs.CV

TL;DR: 本文提出了一种可解释的深度学习框架,使用ResNet50和DenseNet121在MRI和X光图像中检测脑肿瘤和肺炎,并通过Grad-CAM提升模型可解释性,结果显示DenseNet121在准确性和注意力聚焦方面优于ResNet50。

Details Motivation: 深度学习在医学影像诊断中潜力巨大,但缺乏可解释性限制了其临床信任与应用,因此需要构建可解释的模型以提高临床采纳度。 Method: 采用ResNet50和DenseNet121两个卷积神经网络,在Kaggle公开数据集上训练7,023张脑MRI图像和5,863张胸部X光图像,并结合Grad-CAM生成热图以可视化模型决策关键区域。 Result: DenseNet121在脑肿瘤检测中达到94.3%准确率,优于ResNet50的92.5%;在肺炎检测中为89.1%,优于ResNet50的84.4%;Grad-CAM显示DenseNet121更聚焦于病灶核心区域,而ResNet50注意力较分散。 Conclusion: 结合深度学习与可解释AI(如Grad-CAM)有助于构建可靠、可解释且具有临床实用价值的医学影像诊断工具,推动其在临床实践中的应用。 Abstract: Deep Learning (DL) holds enormous potential for improving medical imaging diagnostics, yet the lack of interpretability in most models hampers clinical trust and adoption. This paper presents an explainable deep learning framework for detecting brain tumors in MRI scans and pneumonia in chest X-ray images using two leading Convolutional Neural Networks, ResNet50 and DenseNet121. These models were trained on publicly available Kaggle datasets comprising 7,023 brain MRI images and 5,863 chest X-ray images, achieving high classification performance. DenseNet121 consistently outperformed ResNet50 with 94.3 percent vs. 92.5 percent accuracy for brain tumors and 89.1 percent vs. 84.4 percent accuracy for pneumonia. For better explainability, Gradient-weighted Class Activation Mapping (Grad-CAM) was integrated to create heatmap visualizations superimposed on the test images, indicating the most influential image regions in the decision-making process. Interestingly, while both models produced accurate results, Grad-CAM showed that DenseNet121 consistently focused on core pathological regions, whereas ResNet50 sometimes scattered attention to peripheral or non-pathological areas. Combining deep learning and explainable AI offers a promising path toward reliable, interpretable, and clinically useful diagnostic tools.

[147] Precise classification of low quality G-banded Chromosome Images by reliability metrics and data pruning classifier

Mojtaba Moattari

Main category: cs.CV

TL;DR: 本文提出了一种基于可靠性阈值度量和精心设计特征的染色体分类精度提升方法,适用于低质量图像和低成本系统。

Details Motivation: 现有核型分析系统需要大量高质量训练数据才能实现较高的分类精度,但在一些偏远病理实验室中难以满足这一条件,因此需要在低质量数据条件下提高分类精度以减少误检。 Method: 采用改进的深度Alex-Net神经网络、SVM、K近邻及其级联管道,并结合提出的可靠性阈值度量和工程化特征,对半直染色体进行自动过滤和分类。 Result: 在常见缺失和易位染色体上的分类精度超过90%,并在低质量G显带数据库上验证了方法的有效性。 Conclusion: 所提出的阈值度量和剪枝方法适用于资源有限国家和低预算病理实验室的核型分析系统,具有良好的应用前景。 Abstract: In the last decade, due to high resolution cameras and accurate meta-phase analyzes, the accuracy of chromosome classification has improved substantially. However, current Karyotyping systems demand large number of high quality train data to have an adequately plausible Precision per each chromosome. Such provision of high quality train data with accurate devices are not yet accomplished in some out-reached pathological laboratories. To prevent false positive detections in low-cost systems and low-quality images settings, this paper improves the classification Precision of chromosomes using proposed reliability thresholding metrics and deliberately engineered features. The proposed method has been evaluated using a variation of deep Alex-Net neural network, SVM, K Nearest-Neighbors, and their cascade pipelines to an automated filtering of semi-straight chromosome. The classification results have highly improved over 90% for the chromosomes with more common defections and translocations. Furthermore, a comparative analysis over the proposed thresholding metrics has been conducted and the best metric is bolded with its salient characteristics. The high Precision results provided for a very low-quality G-banding database verifies suitability of the proposed metrics and pruning method for Karyotyping facilities in poor countries and lowbudget pathological laboratories.

[148] Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

Yichi Zhang,Zhuo Chen,Lingbing Guo,Lei Liang,Wen Zhang,Huajun Chen

Main category: cs.CV

TL;DR: 本文提出了一个用于多模态关系知识(MMRK)上结构化与抽象推理(STAR)的自动化数据引擎和两阶段能力提升框架,构建了包含64K样本的STAR-64K数据集,并在多个开源MLLM上验证其有效性,小模型经训练后可超越GPT-4o。

Details Motivation: 现有MLLM在处理视觉模态中的抽象信息,尤其是多模态关系知识(MMRK)上的结构化与抽象推理(STAR)能力不足,且缺乏高质量的大规模数据与有效的能力增强方法。 Method: 设计了一个自动化的STAR数据引擎,用于合成带有MMRK的图像并生成具备可靠思维链的多模态指令数据;提出了一种两阶段能力增强训练框架,并配套针对不同STAR任务的评估协议。 Result: 构建了STAR-64K数据集(64K样本),在5个开源MLLM上实验表明,使用所提框架的小型3B/7B模型在STAR任务上显著优于GPT-4o。 Conclusion: 所提出的自动化数据生成方法和两阶段训练框架有效提升了MLLM在多模态结构化与抽象推理任务上的表现,为未来研究提供了高质量数据和可扩展的技术路径。 Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.

[149] A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis

Yi Yin,Yuntao Shou,Zao Dai,Yun Peng,Tao Meng,Wei Ai,Keqin Li

Main category: cs.CV

TL;DR: 提出一种结合低秩Transformer和基于流的生成模型的新框架,用于鲁棒且灵活的多模态生存预测,在完整和不完整模态场景下均表现出色。

Details Motivation: 现实世界中的多模态医学数据常存在模态缺失问题,现有方法在跨模态分布差异上建模不足,导致重建不可靠。 Method: 采用多实例表示形式化问题,设计类特定流实现跨模态分布对齐,并利用低秩Transformer建模模态内依赖以缓解高维融合过拟合。 Result: 实验表明该方法在完整和不完整模态设置下均达到最优性能,具有强健性和高精度。 Conclusion: 所提框架能有效提升缺失模态重建的一致性与可靠性,显著改善多模态生存分析的鲁棒性。 Abstract: In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.

[150] Towards Accurate and Efficient Waste Image Classification: A Hybrid Deep Learning and Machine Learning Approach

Ngoc-Bao-Quang Nguyen,Tuan-Minh Do,Cong-Tam Phan,Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: 本研究比较了机器学习、深度学习和混合方法在基于图像的垃圾分类中的性能,提出了一种高效的混合框架,在多个公开数据集上实现了接近完美的准确率,同时通过特征选择大幅降低了计算成本。

Details Motivation: 现有的垃圾图像分类研究缺乏系统性的基准测试,尤其是在机器学习、深度学习及混合方法之间的全面比较,且高效性与准确性难以兼顾。 Method: 采用三种范式进行对比:(1)基于手工特征的机器学习方法;(2)深度学习模型(如ResNet变体和EfficientNetV2S);(3)结合深度模型特征提取与传统分类器(如SVM和逻辑回归)的混合方法,并引入特征选择以降低维度。 Result: 混合方法在TrashNet和改进的 Household Garbage Dataset 上达到100%准确率,在Garbage Classification 数据集上达到99.87%,优于现有最先进方法;特征选择使特征维度减少超过95%,显著提升训练和推理速度。 Conclusion: 提出的混合框架在保持高精度的同时显著降低推理成本,为垃圾图像分类建立了更可靠的基准,适用于资源受限环境下的大规模部署。 Abstract: Automated image-based garbage classification is a critical component of global waste management; however, systematic benchmarks that integrate Machine Learning (ML), Deep Learning (DL), and efficient hybrid solutions remain underdeveloped. This study provides a comprehensive comparison of three paradigms: (1) machine learning algorithms using handcrafted features, (2) deep learning architectures, including ResNet variants and EfficientNetV2S, and (3) a hybrid approach that utilizes deep models for feature extraction combined with classical classifiers such as Support Vector Machine and Logistic Regression to identify the most effective strategy. Experiments on three public datasets - TrashNet, Garbage Classification, and a refined Household Garbage Dataset (with 43 corrected mislabels)- demonstrate that the hybrid method consistently outperforms the others, achieving up to 100% accuracy on TrashNet and the refined Household set, and 99.87% on Garbage Classification, thereby surpassing state-of-the-art benchmarks. Furthermore, feature selection reduces feature dimensionality by over 95% without compromising accuracy, resulting in faster training and inference. This work establishes more reliable benchmarks for waste classification and introduces an efficient hybrid framework that achieves high accuracy while reducing inference cost, making it suitable for scalable deployment in resource-constrained environments.

[151] Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images

Pragna Prahallad,Pranathi Prahallad

Main category: cs.CV

TL;DR: 本研究评估了GPT-4o模型在零样本设置下对胸部X光片进行肺炎分类的能力,发现简洁、特征聚焦的提示效果最佳,准确率为74%,但推理型提示表现较差,表明其临床应用仍受限。

Details Motivation: 探索GPT-4o在未经微调的情况下对医学影像进行零样本分类的潜力,特别是在资源有限或无法获取标注数据的场景中。 Method: 使用包含400张平衡X光图像的数据集(每类200张),测试四种不同设计的提示(从极简到详细推理型),评估GPT-4o在零样本下的分类性能。 Result: 特征聚焦的简洁提示达到最高74%的准确率,而引导推理的复杂提示反而表现更差,显示出模型在视觉推理和一致性判断上的局限性。 Conclusion: 尽管GPT-4o展现出一定的医学图像分类潜力,但其诊断可靠性不足,尚不能满足临床需求,需进一步提升视觉推理与领域适应能力。 Abstract: In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74\%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.

[152] Improving the Physics of Video Generation with VJEPA-2 Reward Signal

Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano

Main category: cs.CV

TL;DR: 本报告提出通过结合SSL-based视频世界模型VJEPA-2与先进视频生成模型MAGI-1,提升生成视频的物理合理性,最终在PhysicsIQ挑战赛中获胜,物理可信度提升约6%。

Details Motivation: 现有视频生成模型物理理解能力有限,难以生成物理上合理的视频,而视觉真实感不等于物理正确性。研究旨在探索如何利用自监督学习(SSL)预训练中浮现的直觉物理理解来改进生成模型。 Method: 基于MAGI-1视频生成模型,引入VJEPA-2作为视频世界模型,并将其输出作为奖励信号,指导生成过程,从而增强物理一致性。 Result: 该方法在PhysicsIQ基准上显著提升了生成视频的物理合理性,相比原有模型性能提高约6%。 Conclusion: 利用SSL-based视频世界模型作为物理先验是提升视频生成模型物理可信度的有效途径,验证了将世界模型与生成模型结合的潜力。 Abstract: This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.

[153] RatioWaveNet: A Learnable RDWT Front-End for Robust and Interpretable EEG Motor-Imagery Classification

Marco Siino,Giuseppe Bonomo,Rosario Sorbello,Ilenia Tinnirello

Main category: cs.CV

TL;DR: 本文提出了一种基于可训练有理小波变换(RDWT)前端的RatioWaveNet模型,用于提升基于运动想象的脑机接口在非侵入式EEG信号解码中的鲁棒性,尤其改善了最难被试的表现。

Details Motivation: 由于非平稳性、低信噪比和个体差异,基于EEG的运动想象脑机接口解码仍具挑战性,尤其在表现最差的被试上性能不佳,因此需要提高模型的鲁棒性和稳定性。 Method: 提出RatioWaveNet,结合时域CNN-Transformer主干网络(TCFormer)与可训练的有理扩张小波变换(RDWT)前端,通过未下采样的多分辨率子带分解保留时间长度和位移不变性,并使用轻量化的分组一维卷积融合子带,随后送入多核CNN、分组查询注意力编码器和紧凑TCN头进行特征提取与时序整合。 Result: 在BCI-IV-2a和BCI-IV-2b数据集上,五次随机种子实验表明,RatioWaveNet在最差被试上的准确率显著提升:2a数据集上Sub-Dependent/LOSO分别提升+0.17/+0.42个百分点,2b上提升+1.07/+2.54个百分点,且平均性能稳定提升,计算开销适中。 Conclusion: 可训练的小波前端能有效增强Transformer-based BCI模型的最坏情况可靠性,是一种简单而高效的模块化改进方法。 Abstract: Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging due to nonstationarity, low SNR, and subject variability. We present RatioWaveNet, which augments a strong temporal CNN-Transformer backbone (TCFormer) with a trainable, Rationally-Dilated Wavelet Transform (RDWT) front end. The RDWT performs an undecimated, multi-resolution subband decomposition that preserves temporal length and shift-invariance, enhancing sensorimotor rhythms while mitigating jitter and mild artifacts; subbands are fused via lightweight grouped 1-D convolutions and passed to a multi-kernel CNN for local temporal-spatial feature extraction, a grouped-query attention encoder for long-range context, and a compact TCN head for causal temporal integration. Our goal is to test whether this principled wavelet front end improves robustness precisely where BCIs typically fail - on the hardest subjects - and whether such gains persist on average across seeds under both intra- and inter-subject protocols. On BCI-IV-2a and BCI-IV-2b, across five seeds, RatioWaveNet improves worst-subject accuracy over the Transformer backbone by +0.17 / +0.42 percentage points (Sub-Dependent / LOSO) on 2a and by +1.07 / +2.54 percentage points on 2b, with consistent average-case gains and modest computational overhead. These results indicate that a simple, trainable wavelet front end is an effective plug-in to strengthen Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.

[154] Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

Michael Aerni,Joshua Swanson,Kristina Nikolić,Florian Tramèr

Main category: cs.CV

TL;DR: 本文提出了“模态失语”现象,即当前的统一多模态模型虽能准确记忆视觉概念,但在文字描述时却出现系统性失败,揭示了跨模态一致性问题及其对AI安全框架的潜在威胁。

Details Motivation: 研究旨在揭示当前统一多模态模型在视觉与文本模态之间存在的系统性不一致问题,特别是在概念表达上的‘模态失语’现象,并探讨其对AI安全的影响。 Method: 通过让前沿模型生成经典电影画面并对比其文本描述的准确性,结合在合成数据集上的受控实验,验证多种架构中模态失语的普遍存在性。 Result: 实验证明,当前多模态模型普遍存在模态失语现象,即使能完美再现图像,也会在文本描述中混淆关键细节;此外,仅基于文本对齐的模型仍可能生成有害图像,暴露安全漏洞。 Conclusion: 模态失语是当前统一多模态模型的基本属性,而非训练副产品,这对多模态AI的安全性和可靠性提出了重要挑战,需建立跨模态一致性保障机制。 Abstract: We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.

[155] SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim,Yemo Koo,Vijay Krishna Madisetti

Main category: cs.CV

TL;DR: SCoPE VLM提出了一种新的视觉语言模型框架,通过Chain of Scroll机制和专门的强化学习方法,实现高效、低内存的多页文档问答,首次显式建模了代理式的阅读模式。

Details Motivation: 现有视觉语言模型在处理长上下文视觉信息时忽视了面向决策的文档理解,且扩展视觉嵌入的方法内存消耗大,难以本地部署。 Method: 提出SCoPE VLM,采用Chain of Scroll机制递归选择性地导航文档,并设计了专用数据生成流程和Episodic Group Relative Policy Optimization强化学习方法来缩小训练与推理差距。 Result: 显著降低内存使用,有效模拟人类阅读行为,在多页文档问答中表现出色。 Conclusion: SCoPE VLM是首个显式建模代理式阅读模式的框架,提升了多模态代理在文档导航任务中的能力。 Abstract: Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.

[156] Poisson Flow Consistency Training

Anthony Zhang,Mahmut Gokmen,Dennis Hein,Rongjun Ge,Wenjun Xia,Ge Wang,Jin Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Poisson Flow Consistency Training (PFCT) 的新方法,用于独立训练Poisson Flow Consistency Model (PFCM),无需依赖预训练的PFGM++,并通过改进离散化策略和噪声分布提升了生成质量,在低剂量CT图像去噪任务中表现出色。

Details Motivation: PFCM此前仅能通过蒸馏方式训练,限制了其在多种数据模态中的应用,因此需要一种可独立训练PFCM的方法以拓展其适用性。 Method: 利用扰动核消除对预训练PFGM++的依赖,引入正弦离散化调度和Beta噪声分布,实现PFCM的独立训练(PFCT)。 Result: 在低剂量CT图像去噪任务中,PFCT在LPIPS和SSIM指标上表现良好,去噪效果与一致性模型相当,验证了其有效性。 Conclusion: PFCT是一种有效的PFCM独立训练方法,具有与其他生成模型竞争的潜力,为生成建模提供了更大的灵活性,未来可进一步优化并拓展至其他任务。 Abstract: The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generation and CT image denoising. Yet the PFCM can only be trained in distillation which limits the potential of the PFCM in many data modalities. The objective of this research was to create a method to train the PFCM in isolation called Poisson Flow Consistency Training (PFCT). The perturbation kernel was leveraged to remove the pretrained PFGM++, and the sinusoidal discretization schedule and Beta noise distribution were introduced in order to facilitate adaptability and improve sample quality. The model was applied to the task of low dose computed tomography image denoising and improved the low dose image in terms of LPIPS and SSIM. It also displayed similar denoising effectiveness as models like the Consistency Model. PFCT is established as a valid method of training the PFCM from its effectiveness in denoising CT images, showing potential with competitive results to other generative models. Further study is needed in the precise optimization of PFCT and in its applicability to other generative modeling tasks. The framework of PFCT creates more flexibility for the ways in which a PFCM can be created and can be applied to the field of generative modeling.

[157] A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Muhammad Tayyab Khan,Zane Yong,Lequn Chen,Wenhe Feng,Nicholas Yew Jin Tan,Seung Ki Moon

Main category: cs.CV

TL;DR: 提出了一种三阶段混合框架,利用现代检测和视觉语言模型(VLMs)实现2D多视图工程图纸的自动解析。

Details Motivation: 传统方法在处理布局多样、方向复杂且包含混合符号与文本内容的工程图纸时存在困难,难以准确解读设计意图、公差和生产细节。 Method: 第一阶段使用YOLOv11-det进行布局分割;第二阶段采用YOLOv11-obb实现方向感知的细粒度标注检测;第三阶段利用两个基于Donut的无OCR视觉语言模型分别解析文本和数值信息。 Result: Alphabetical VLM的F1得分为0.672,Numerical VLM达到0.963,验证了该方法在文本和定量信息提取上的有效性,并生成统一JSON格式输出以集成到CAD系统中。 Conclusion: 该框架显著提升了工程图纸自动化解析的准确性与可扩展性,为智能制造提供了高效、可靠的解决方案。 Abstract: Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

[158] LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation

Xin Lu,Chuanqing Zhuang,Chenxi Jin,Zhengda Lu,Yiqun Wang,Wu Liu,Jun Xiao

Main category: cs.CV

TL;DR: 提出了一种名为LSF-Animation的新框架,无需显式情感和身份特征表示,通过语音隐式提取情感信息并从中性面部网格捕获身份特征,提升了对未见说话人和情感状态的泛化能力。

Details Motivation: 现有方法依赖于显式的一热编码来表示身份和情感,限制了对未见说话人的泛化能力,且忽略了语音中固有的情感线索,影响动画的自然性和适应性。 Method: LSF-Animation框架通过语音隐式提取情感信息,并从中性面部网格中捕获身份特征;引入层次化交互融合块(HIFB),利用融合token整合双Transformer特征,有效融合情感、运动和身份相关线索。 Result: 在3DMEAD数据集上的实验表明,该方法在情感表现力、身份泛化能力和动画真实感方面优于最新的先进方法。 Conclusion: LSF-Animation通过消除对显式情感和身份标签的依赖,结合HIFB模块有效融合多模态线索,在语音驱动的3D面部动画中实现了更好的泛化性和自然性。 Abstract: Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.

[159] Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs

Haicheng Liao,Bonan Wang,Junxian Yang,Chengyue Wang,Zhengbin He,Guohui Zhang,Chengzhong Xu,Zhenning Li

Main category: cs.CV

TL;DR: 本文提出了一种基于世界模型的运动预测框架WM-MoE,统一感知、时序记忆与决策,专为自动驾驶中高风险的罕见场景(corner cases)设计,结合大语言模型与混合专家系统,提升长时推理与泛化能力,并构建新基准nuScenes-corner进行评估。

Details Motivation: 现有运动预测模型在罕见但关键的安全场景(corner cases)中表现不佳,主要由于训练数据偏向常见场景且泛化能力有限,难以满足自动驾驶安全需求。 Method: 提出WM-MoE框架:1)构建紧凑场景表征的世界模型;2)引入轻量级时序分词器,将轨迹和上下文映射到大语言模型特征空间以增强时序与常识先验;3)采用混合专家系统(MoE)分解复杂场景,通过路由机制分配专用专家进行意图推断与反事实推演。 Result: 在nuScenes、NGSIM、HighD和MoCAD四个数据集上实验表明,WM-MoE在常规与corner-case场景下均优于现有SOTA方法,在数据缺失情况下仍保持鲁棒性;新构建的nuScenes-corner基准有效支持高风险场景评估。 Conclusion: 基于世界模型的架构结合LLM与MoE机制,显著提升了自动驾驶在复杂、罕见场景下的运动预测鲁棒性与泛化能力,验证了其在安全关键应用中的潜力。 Abstract: Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing models often underperform in these situations due to an over-representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM-MoE, the first world model-based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long-horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM's feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture-of-experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes-corner, a new benchmark that comprises four real-world corner-case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions, indicating the promise of world model-based architectures for robust and generalizable motion forecasting in fully AVs.

[160] AI Powered Urban Green Infrastructure Assessment Through Aerial Imagery of an Industrial Township

Anisha Dutta

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习和无人机影像的高效城市树冠覆盖率估算方法,结合云计算平台实现大规模快速分析,为城市林业管理和可持续规划提供支持。

Details Motivation: 传统方法在技术要求、可扩展性、数据处理和专业技能方面存在局限,难以准确评估城市树冠覆盖率。 Method: 采用基于深度学习的对象化图像分析方法,对高分辨率无人机影像进行绿色树冠识别与分割,并在云平台上部署以提高计算效率。 Result: 该方法能够高效、准确地在城市尺度上估算树冠覆盖率,有效管理空间复杂度并降低延迟,适用于大规模影像分析。 Conclusion: 该方法为城市植被的空间分布和密度分析提供了有力工具,生成的数据可用于优化植树规划和碳汇潜力评估,助力可持续城市发展。 Abstract: Accurate assessment of urban canopy coverage is crucial for informed urban planning, effective environmental monitoring, and mitigating the impacts of climate change. Traditional practices often face limitations due to inadequate technical requirements, difficulties in scaling and data processing, and the lack of specialized expertise. This study presents an efficient approach for estimating green canopy coverage using artificial intelligence, specifically computer vision techniques, applied to aerial imageries. Our proposed methodology utilizes object-based image analysis, based on deep learning algorithms to accurately identify and segment green canopies from high-resolution drone images. This approach allows the user for detailed analysis of urban vegetation, capturing variations in canopy density and understanding spatial distribution. To overcome the computational challenges associated with processing large datasets, it was implemented over a cloud platform utilizing high-performance processors. This infrastructure efficiently manages space complexity and ensures affordable latency, enabling the rapid analysis of vast amounts of drone imageries. Our results demonstrate the effectiveness of this approach in accurately estimating canopy coverage at the city scale, providing valuable insights for urban forestry management of an industrial township. The resultant data generated by this method can be used to optimize tree plantation and assess the carbon sequestration potential of urban forests. By integrating these insights into sustainable urban planning, we can foster more resilient urban environments, contributing to a greener and healthier future.

[161] TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

Shu-Hao Zhang,Wei-Cheng Tang,Chen Wu,Peng Hu,Nan Li,Liang-Jie Zhang,Qi Zhang,Shao-Qun Zhang

Main category: cs.CV

TL;DR: TernaryCLIP提出了一种轻量级计算框架,通过将CLIP模型的视觉和文本编码器权重转换为三值格式,实现高效压缩与推理加速,同时在多种任务上保持良好性能。

Details Motivation: 为了在资源受限设备上高效部署大规模多模态模型,需要减少模型的计算开销、存储需求和内存占用,因此探索极端量化(如三值化)的可行性。 Method: 提出TernaryCLIP框架,采用三值化权重表示,并结合量化感知训练和知识蒸馏模块,以避免精度损失。 Result: 实现了高达99%的权重三值化,仅用1.58位表示,获得16.98倍压缩比、2.3倍推理加速、16倍存储减少、10倍内存优化和60%稀疏性,在41个数据集上的零样本图像分类和图文检索任务中表现良好。 Conclusion: TernaryCLIP验证了极端量化在大型多模态模型中的可行性,支持在资源受限设备上的高效部署。 Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.

[162] Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications

Shamim Yazdani,Akansha Singh,Nripsuta Saxena,Zichong Wang,Avash Palikhe,Deng Pan,Umapada Pal,Jie Yang,Wenbin Zhang

Main category: cs.CV

TL;DR: 本文综述了基于深度学习的生成模型(GANs、VAEs、DMs)的发展,提出了一种全面的分类体系,总结了关键技术进展、伦理问题,并展望了未来研究方向。

Details Motivation: 由于生成模型快速发展、研究庞杂、应用广泛且存在技术挑战,亟需系统性梳理以帮助研究者跟进领域进展。 Method: 通过构建综合分类体系,系统整理GANs、VAEs和DMs及其变体与融合方法的研究文献,分析技术演进、关键创新、伦理问题及社会影响。 Result: 提出了统一的生成模型分类框架,总结了提升生成质量、多样性和可控性的关键技术,识别了伦理风险,并明确了当前未解决的挑战。 Conclusion: 该综述为快速发展的生成人工智能领域提供了结构化的知识体系和前瞻性视角,有助于指导未来研究。 Abstract: In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrumental in in generating diverse, high-quality content across various domains, such as image and video synthesis. This capability has led to widespread adoption of these models and has captured strong public interest. As they continue to advance at a rapid pace, the growing volume of research, expanding application areas, and unresolved technical challenges make it increasingly difficult to stay current. To address this need, this survey introduces a comprehensive taxonomy that organizes the literature and provides a cohesive framework for understanding the development of GANs, VAEs, and DMs, including their many variants and combined approaches. We highlight key innovations that have improved the quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative artificial intelligence. In addition to summarizing technical progress, we examine rising ethical concerns, including the risks of misuse and the broader societal impact of synthetic media. Finally, we outline persistent challenges and propose future research directions, offering a structured and forward looking perspective for researchers in this fast evolving field.

[163] Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Dogyun Park,Moayed Haji-Ali,Yanyu Li,Willi Menapace,Sergey Tulyakov,Hyunwoo J. Kim,Aliaksandr Siarohin,Anil Kag

Main category: cs.CV

TL;DR: SPRINT是一种用于高效扩散Transformer的稀疏-密集残差融合方法,能够在大幅减少训练计算成本的同时保持生成质量。

Details Motivation: Diffusion Transformers(DiTs)虽然生成性能优异,但其训练计算成本随序列长度呈二次增长,限制了大规模预训练的应用。现有token dropping方法在高丢弃率下表现不佳或引入过多额外参数,因此需要一种更高效且轻量的方法。 Method: 提出SPRINT方法,利用浅层和深层网络的不同作用:浅层处理全部token以保留局部细节,深层仅处理稀疏子集以降低计算量,并通过残差连接融合两者输出。采用两阶段训练策略:先进行长周期的掩码预训练以提升效率,再进行短周期的全token微调以缩小训练与推理之间的差距。 Result: 在ImageNet-1K 256x256上,SPRINT实现了9.8倍的训练节省,同时保持与基线相当的FID/FDD指标;推理时结合路径丢弃引导(PDG)可将FLOPs减少近一半并提升生成质量。 Conclusion: SPRINT是一种简单、高效且通用的DiT训练加速方案,支持高达75%的token丢弃率,显著降低训练和推理成本而不牺牲性能。 Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

[164] LiteDiff

Ruchir Namjoshi,Nagasai Thadishetty,Vignesh Kumar,Hemanth Venkateshwara

Main category: cs.CV

TL;DR: 本文提出了一种名为Lite-Diff的轻量级扩散模型微调方法,通过在冻结的U-Net中引入小型适配模块,并结合潜在形态自编码器和像素级判别器,显著降低了计算开销并提升了在小样本医学图像数据上的适应性能。

Details Motivation: 由于特定领域(如医学影像)数据有限且完全微调扩散模型计算成本高,传统微调方法面临挑战,因此需要一种高效、低资源消耗的模型适配方法。 Method: 提出Lite-Diff方法,在冻结的扩散U-Net中插入轻量级适配层,仅训练这些小型残差模块;同时引入潜在形态自编码器以增强潜在空间的一致性,并使用像素级判别器实现对抗对齐。通过消融实验分析不同U-Net块中适配层的部署策略。 Result: 在三个胸部X光数据集(Kaggle、NIH、VinBigData)上验证了Lite-Diff的有效性,相比全模型微调具有更高的适应效率和更优性能,尤其在数据稀缺场景下表现突出。 Conclusion: Lite-Diff为扩散模型在低数据领域的迁移学习提供了一种高效可行的解决方案,有助于推动其在医学影像等专业领域的实际部署。 Abstract: In recent years, diffusion models have demonstrated remarkable success in high-fidelity image synthesis. However, fine-tuning these models for specialized domains, such as medical imaging, remains challenging due to limited domain-specific data and the high computational cost of full model adaptation. In this paper, we introduce Lite-Diff (Lightweight Diffusion Model Adaptation), a novel finetuning approach that integrates lightweight adaptation layers into a frozen diffusion U-Net while enhancing training with a latent morphological autoencoder (for domain-specific latent consistency) and a pixel level discriminator(for adversarial alignment). By freezing weights of the base model and optimizing only small residual adapter modules, LiteDiff significantly reduces the computational overhead and mitigates overfitting, even in minimal-data settings. Additionally, we conduct ablation studies to analyze the effects of selectively integrating adaptation layers in different U-Net blocks, revealing an optimal balance between efficiency and performance. Experiments on three chest X-ray datasets - (1) Kaggle Chest X-Ray Pneumonia, (2) NIH Chest X-ray14 and (3) VinBigData Chest X_ray demonstrate that LiteDiff achieves superior adaptation efficiency compared to naive full fine-tuning. Our framework provides a promising direction for transfer learning in diffusion models, facilitating their deployment in diverse low data domains.

[165] FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing

Or Ronai,Vladimir Kulikov,Tomer Michaeli

Main category: cs.CV

TL;DR: 提出FlowOpt,一种无需梯度的零阶优化框架,用于在扩散和流匹配模型中实现高效、可控的生成任务,支持全程优化并保证收敛性,在图像编辑等任务中达到SOTA效果。

Details Motivation: 由于扩散和流匹配模型采样过程的迭代性,直接使用梯度优化控制最终生成图像在计算上不现实,现有方法通常只能逐时间步操作,缺乏对整个生成路径的有效控制。 Method: 提出FlowOpt,将整个流过程视为黑箱,采用零阶(无梯度)优化方法,在不进行反向传播的情况下优化完整采样路径;提供步长的充分收敛条件,并通过经验估计该上界以选择合适步长。 Result: 在图像编辑任务中,FlowOpt在相同神经网络函数评估次数下实现了最先进的结果,支持初始噪声反演和基于文本引导的图像编辑,并允许监控中间结果与早停。 Conclusion: FlowOpt为扩散与流匹配模型提供了一种高效、灵活且理论上可保证收敛的测试时优化方法,拓展了其在图像编辑、修复等可控生成任务中的应用潜力。 Abstract: The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.

[166] Reconnaissance Automatique des Langues des Signes : Une Approche Hybridée CNN-LSTM Basée sur Mediapipe

Fraisse Sacré Takouchouang,Ho Tuong Vinh

Main category: cs.CV

TL;DR: 提出了一种基于混合CNN-LSTM架构的自动手语识别系统,使用Mediapipe进行手势关键点提取,实现了92%的平均准确率。

Details Motivation: 手语在听障群体交流中至关重要,但常被边缘化,限制了他们在医疗和教育等基本服务中的获取。因此需要开发自动手语识别系统以提升沟通效率和可及性。 Method: 采用混合CNN-LSTM架构,结合Mediapipe进行手势关键点提取,使用Python、TensorFlow和Streamlit构建实时手势翻译系统。 Result: 系统平均准确率达到92%,对“Hello”和“Thank you”等差异明显的手势表现良好,但在“Call”和“Yes”等视觉相似手势上仍存在混淆。 Conclusion: 该系统为医疗、教育和公共服务等领域的应用提供了有意义的前景,但仍需改进对相似手势的区分能力。 Abstract: Sign languages play a crucial role in the communication of deaf communities, but they are often marginalized, limiting access to essential services such as healthcare and education. This study proposes an automatic sign language recognition system based on a hybrid CNN-LSTM architecture, using Mediapipe for gesture keypoint extraction. Developed with Python, TensorFlow and Streamlit, the system provides real-time gesture translation. The results show an average accuracy of 92\%, with very good performance for distinct gestures such as ``Hello'' and ``Thank you''. However, some confusions remain for visually similar gestures, such as ``Call'' and ``Yes''. This work opens up interesting perspectives for applications in various fields such as healthcare, education and public services.

[167] Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller,Amil V. Dravid,Guido M. Schuster,Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: 提出一种基于字幕的可解释人工智能方法,通过将独立模型集成到CLIP中,识别对预测贡献最大的主导概念,提升机器学习模型的鲁棒性。

Details Motivation: 现有基于显著性图的XAI方法可能因突出但虚假特征而误导模型,难以应对协变量偏移,影响模型鲁棒性。 Method: 采用新颖的网络手术方法,将待解释的独立模型集成到对比语言-图像预训练(CLIP)模型中,构建基于字幕的XAI模型,以识别主导概念。 Result: 所提方法能有效识别影响模型预测的关键语义概念,减少模型对虚假特征的依赖。 Conclusion: 该方法降低了模型受协变量偏移影响的风险,有助于构建更鲁棒的机器学习模型。 Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models.

[168] VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT

Hyeonsu Kang,Emily Bao,Anjan Goswami

Main category: cs.CV

TL;DR: 提出了VLM-SlideEval框架,用于评估视觉语言模型在幻灯片理解上的表现,发现当前模型在像素级提取和跨页叙事结构理解上存在局限。

Details Motivation: 尽管视觉语言模型(VLMs)越来越多地被用作评估多模态内容的工具,但其对幻灯片的特定理解能力尚缺乏深入探索。 Method: 构建了一个包含三个维度的评估框架:元素级提取、对几何/样式/文本扰动的鲁棒性、以及高级理解(如从打乱的幻灯片中恢复叙述顺序),并在来自Zenodo的公开数据集上进行测试。 Result: 实验表明,VLMs在像素级提取和受控扰动下的稳定性方面表现不佳,虽能较好理解单张幻灯片内容,但无法可靠捕捉跨幻灯片的叙事结构。 Conclusion: 当前VLM在幻灯片评估任务中存在明显局限,需引入更精准、带有反馈循环的评估机制以支持智能代理流程中的迭代优化。 Abstract: Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines.

[169] Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning

Mohammad Ali Etemadi Naeen,Hoda Mohammadzade,Saeed Bagheri Shouraki

Main category: cs.CV

TL;DR: 提出一种结合人体中心预处理和时空建模的深度学习框架,用于监控视频中的多类别异常检测,在UCF-Crime子集上达到92.41%的平均准确率。

Details Motivation: 解决监控视频中异常事件多样性、类别不平衡和场景依赖性视觉杂乱带来的检测难题。 Method: 使用YOLO-World检测人体实例,ByteTrack进行身份感知跟踪,高斯模糊抑制背景,InceptionV3提取空间特征,BiLSTM建模时序动态以实现分类。 Result: 在五类UCF-Crime数据集上,平均测试准确率达92.41%,各类F1分数均超过0.85,表现出强泛化能力和对类别不平衡的鲁棒性。 Conclusion: 前景聚焦的预处理显著提升了真实监控场景中的异常辨别能力。 Abstract: Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages - demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.

[170] Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Zheng Qi,Chao Shang,Evangelia Spiliopoulou,Nikolaos Pappas

Main category: cs.CV

TL;DR: 提出了一种名为Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) 的简单而有效的方法,通过追踪视觉注意力的正向变化(“凝视转移”)来预计算整体视觉显著性图,并利用该图在每个解码步骤中增强对显著视觉信息和用户查询的注意力,从而减少视觉注意力沉降的影响,并确保跨模态融合的平衡。

Details Motivation: 现有方法在缓解视觉语言模型(VLMs)的幻觉问题时,忽视了视觉注意力沉降问题,并且未平衡跨模态融合,仅增强视觉注意力而忽略对用户查询的调整,导致错误区域被放大且无法正确理解用户意图。 Method: GIFT通过追踪用户查询理解过程中视觉注意力的正向变化(即“凝视转移”),预计算一个整体视觉显著性图,并在每个解码步骤中利用该图同时增强对显著视觉区域和用户查询的注意力,实现更均衡的跨模态融合。 Result: 实验表明,GIFT在生成和分类任务中均能有效缓解VLMs的幻觉问题,相比贪婪解码最高提升20.7%,同时保持较低计算开销和良好的通用视觉-语言性能。 Conclusion: GIFT通过引导注意力聚焦于真正相关的视觉区域并平衡跨模态注意力,有效减少了幻觉,提升了VLMs的可靠性和表现力。 Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

[171] Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement

Luca Caldera,Lara Cavinato,Francesca Ieva

Main category: cs.CV

TL;DR: 提出了一种基于图像的3D T1加权脑MRI去中心化偏倚框架,通过可微的SSIM损失分离解剖内容与扫描仪/站点特异性变异,显著提升跨中心图像一致性、保持解剖保真度,并改善下游任务性能。

Details Motivation: 不同MRI扫描仪、采集协议和成像中心带来的变异性阻碍了多中心神经影像研究的一致性分析和泛化能力,亟需有效的图像标准化方法。 Method: 提出一种基于图像的3D T1加权脑MRI图像协同化框架,采用可微的结构相似性(SSIM)损失函数,分别优化亮度、对比度和结构成分,解耦解剖信息与设备/站点相关变异,并使用多风格目标进行训练。 Result: 在多个公开数据集上验证,结构SSIM达0.97,亮度SSIM为0.98–0.99,体素强度分布的Wasserstein距离显著降低;脑龄预测MAE从5.36降至3.30年,阿尔茨海默病分类AUC从0.78提升至0.85。 Conclusion: 该框架有效提升多中心MRI数据的一致性与可比性,在保留解剖特征的同时增强下游模型性能,为大规模多中心神经影像研究提供了鲁棒且可推广的解决方案。 Abstract: The variability introduced by differences in MRI scanner models, acquisition protocols, and imaging sites hinders consistent analysis and generalizability across multicenter studies. We present a novel image-based harmonization framework for 3D T1-weighted brain MRI, which disentangles anatomical content from scanner- and site-specific variations. The model incorporates a differentiable loss based on the Structural Similarity Index (SSIM) to preserve biologically meaningful features while reducing inter-site variability. This loss enables separate evaluation of image luminance, contrast, and structural components. Training and validation were performed on multiple publicly available datasets spanning diverse scanners and sites, with testing on both healthy and clinical populations. Harmonization using multiple style targets, including style-agnostic references, produced consistent and high-quality outputs. Visual comparisons, voxel intensity distributions, and SSIM-based metrics demonstrated that harmonized images achieved strong alignment across acquisition settings while maintaining anatomical fidelity. Following harmonization, structural SSIM reached 0.97, luminance SSIM ranged from 0.98 to 0.99, and Wasserstein distances between mean voxel intensity distributions decreased substantially. Downstream tasks showed substantial improvements: mean absolute error for brain age prediction decreased from 5.36 to 3.30 years, and Alzheimer's disease classification AUC increased from 0.78 to 0.85. Overall, our framework enhances cross-site image consistency, preserves anatomical fidelity, and improves downstream model performance, providing a robust and generalizable solution for large-scale multicenter neuroimaging studies.

[172] Mitigating Coordinate Prediction Bias from Positional Encoding Failures

Xingjian Tao,Yiwei Wang,Yujun Cai,Yihong Luo,Jing Tang

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型(MLLMs)在高分辨率输入下坐标预测困难的问题,发现视觉位置编码(VPE)的退化会导致可预测的方向性偏差。为此,作者提出了一种无需训练的测试时校正方法VPSG,通过打乱VPE生成辅助解码来纠正偏差,在ScreenSpot-Pro上验证了其有效性。

Details Motivation: MLLMs在视觉-语言任务中表现优异,但在精确坐标预测方面仍存在挑战,尤其是在高分辨率输入下,位置编码弱化和方向偏差问题突出。本文旨在探究该现象的根本原因并提出解决方案。 Method: 通过故意打乱视觉位置编码(VPE)分析MLLMs的行为变化,观察到非随机的、可预测的方向性误差;基于此提出VPSG方法:在测试时利用打乱VPE进行辅助解码,提取位置无关倾向,并将其作为负证据指导数字预测,同时使用轻量级有限状态机保持坐标格式。 Result: 实验表明,VPSG能在不需训练的情况下显著提升MLLMs在高分辨率场景下的坐标预测准确性;在ScreenSpot-Pro数据集上实现了稳定性能增益,且发现自然高分辨率数据中存在与人为扰动相似的误差模式,证实位置编码失效是主要瓶颈。 Conclusion: 位置编码的鲁棒性对MLLM的空间推理能力至关重要;VPSG通过利用误差的方向性特征实现有效的坐标校正,为提升多模态模型的空间定位能力提供了新思路。 Abstract: Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs exacerbate this difficulty by producing long token sequences that weaken positional encodings and introduce directional biases in coordinate outputs. We investigate this phenomenon by analyzing how MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed through shuffling. Our analysis reveals that such perturbations induce predictable, non-random coordinate biases rather than random errors, suggesting that models rely on internal positional priors when spatial grounding signals are degraded. Crucially, we observe similar directional error patterns in natural high-resolution datasets, indicating that positional encoding failures are a key bottleneck for accurate coordinate prediction at scale. To address this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free test-time method that leverages the directional nature of these biases for correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate position-unconditioned tendencies, then uses this as negative evidence to guide digit prediction while preserving coordinate format through a lightweight finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable improvements, highlighting positional encoding robustness as a critical factor for spatial reasoning in MLLMs.

[173] Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Bailey Trang,Parham Saremi,Alan Q. Wang,Fangrui Huang,Zahra TehraniNasab,Amar Kumar,Tal Arbel,Li Fei-Fei,Ehsan Adeli

Main category: cs.CV

TL;DR: 提出Rainbow框架,通过分解输入条件为多样化的潜在表示,利用生成流网络(GFlowNets)在图上采样多条轨迹,从而在条件图像生成中实现更高的多样性与保真度。

Details Motivation: 传统方法难以有效捕捉条件或提示中的不确定性,导致生成图像多样性不足或差异不明确,需要一种能系统反映条件不确定性的多样化生成方法。 Method: 将输入条件分解为多个潜在表示,引入由生成流网络(GFlowNets)参数化的潜在图,并利用其图采样能力生成多条代表不同不确定性方面的轨迹,每条轨迹生成一个独特图像。 Result: 在自然图像和医学图像数据集上的实验表明,Rainbow在图像合成、生成和反事实生成任务中均提升了生成结果的多样性和保真度。 Conclusion: Rainbow是一种通用且有效的条件图像生成框架,能够显式建模条件不确定性,通过GFlowNets生成多样化且高质量的图像。 Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets' advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow's improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.

[174] GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Karim Elmaaroufi,Liheng Lai,Justin Svegliato,Yutong Bai,Sanjit A. Seshia,Matei Zaharia

Main category: cs.CV

TL;DR: 本文提出了GRAID框架,通过仅使用2D边界框提取定性空间关系,避免了3D重建误差和生成式幻觉,显著提升了视觉语言模型训练数据的质量。在BDD100k、NuImages和Waymo数据集上生成了超过850万高质量VQA样本,并验证了其在空间推理任务上的优越性能。

Details Motivation: 现有视觉语言模型在空间推理任务上表现不佳,主要因为当前数据生成方法存在3D重建误差、生成幻觉和标注过于依赖细节等问题,导致数据质量低(人类验证通过率仅57.6%)。 Method: 提出GRAID框架,基于标准目标检测器输出的2D边界框直接推导定性空间关系,不依赖3D重建或图像描述生成,从而避免建模错误和幻觉问题,并应用于BDD100k、NuImages和Waymo等数据集生成高质量VQA对。 Result: GRAID生成的数据集人类验证准确率达到91.16%,显著高于现有方法的57.6%;在BDD和NuImages上,模型在未见问题类型上分别提升47.5%和37.9%准确率,并在BLINK等多个基准上实现性能提升。 Conclusion: GRAID能高效生成高质量空间推理数据,有效提升模型泛化能力,为视觉语言模型的空间理解提供了可靠且可扩展的数据生成方案。 Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6\% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16\% human-validated accuracy\textemdash{}compared to 57.6\% on a dataset generated by recent work. % or recent work Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5\% on BDD and 37.9\% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found on our \href{https://ke7.github.io/graid/}{project page}.

[175] CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding

Lihuang Fang,Xiao Hu,Yuchen Zou,Hong Zhang

Main category: cs.CV

TL;DR: 提出CogStereo框架,通过引入单目深度特征作为先验,在无监督条件下提升立体匹配在遮挡和弱纹理区域的性能,实现跨域泛化。

Details Motivation: 现有深度立体匹配方法依赖精细调优,在零样本泛化上表现不足,难以处理遮挡和弱纹理等挑战区域。 Method: CogStereo利用单目深度特征嵌入空间认知,结合像素级不确定性与认知引导特征,采用双条件精炼机制进行全局误匹配校正。 Result: 在Scene Flow、KITTI、Middlebury等多个数据集及真实场景中表现出色,达到最先进水平,并展现出优异的跨域泛化能力。 Conclusion: CogStereo通过融合隐式空间认知推动立体匹配向类基础模型的认知驱动范式转变。 Abstract: Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach.

[176] Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

Wenxuan Bao,Ruxi Deng,Jingrui He

Main category: cs.CV

TL;DR: 本文研究了预训练视觉-语言模型CLIP在输入受到污染时图像嵌入的表现,发现随着污染程度增加,类内和类间方差均会坍缩,即“嵌入方差坍缩”现象,并指出该现象与性能下降密切相关。作者从理论上解释了这一现象,提出视觉编码器倾向于编码与污染相关的信号,从而削弱了类别判别特征。基于此,本文提出了Mint方法,在测试时通过最大化基于伪标签的类间方差来提升嵌入质量,仅使用小批量即可有效运行,并在多个基准上提升了CLIP的鲁棒性。

Details Motivation: CLIP等视觉-语言模型虽具备强零样本泛化能力,但在面对输入污染引起的数据分布偏移时仍表现脆弱,亟需理解其失败机制并提升其鲁棒性。 Method: 分析不同污染程度下CLIP图像嵌入的方差变化,提出“嵌入方差坍缩”现象;通过理论分析揭示污染信号如何压缩表示空间;提出Mint方法,在测试时利用均值累加器和梯度累加器动态最大化伪标签下的类间方差。 Result: 发现嵌入方差坍缩与分类准确率显著相关;理论表明污染信号稀释了类别特征;Mint方法在多个污染基准和CLIP架构上均实现了稳定性能提升,且适用于小批量场景。 Conclusion: 嵌入方差坍缩是理解CLIP在污染下性能下降的关键机制,通过测试时最大化类间方差(如Mint)可有效增强模型鲁棒性,为无需重训练的适应方法提供了新思路。 Abstract: Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when estimated from pseudo-labels, can provably enhance embedding quality. Based on this insight, we propose Mint, a simple test-time adaptation method that maximizes pseudo-label-based inter-class variance on the fly using a mean accumulator and a gradient accumulator. Mint operates effectively with small batch sizes and consistently improves performance across multiple corruption benchmarks and CLIP architectures. Our code is available at https://github.com/baowenxuan/Mint .

[177] egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

Matthias Jammot,Bjöern Braun,Paul Streli,Rafael Wampfler,Christian Holz

Main category: cs.CV

TL;DR: 本文提出了egoEMOTION,首个结合自我中心视觉与生理信号并包含情感和人格密集自我报告的数据集,旨在推动对人类行为内在驱动因素(如情绪和个性)的理解。

Details Motivation: 现有自我中心视觉基准大多忽略影响人类决策和行为的情感状态,局限于物理活动和注意力建模,限制了视觉系统对行为内在驱动的捕捉能力。 Method: 收集了43名参与者在控制环境和真实场景下的超过50小时数据,使用Meta的Project Aria眼镜记录同步的眼动视频、光电容积脉搏波、惯性运动数据等,并结合环状模型、Mikels轮和大五人格模型进行自我报告。定义了三个基准任务:连续情感分类、离散情绪分类和人格特质推断。 Result: 实验表明,基于学习的简单基线方法在利用自我中心视觉信号进行现实世界情感预测时优于仅使用生理信号的方法。 Conclusion: egoEMOTION将情感和人格确立为自我中心感知的核心维度,为情感驱动的行为、意图和交互建模开辟了新方向。 Abstract: Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

[178] STG-Avatar: Animatable Human Avatars via Spacetime Gaussian

Guangan Jiang,Tianzi Zhang,Dong Li,Zhenjun Zhao,Haoang Li,Mingrui Li,Hongyu Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于3DGS的高保真可动画化人体avatar重建框架STG-Avatar,结合时空高斯与线性混合蒙皮,有效提升了非刚性和动态区域的建模精度,并实现高质量实时渲染。

Details Motivation: 现有基于3DGS的人体avatar方法在处理非刚性物体(如衣物形变)和高动态区域(如快速运动肢体)时表现不足,难以实现高保真动画重建。 Method: 提出刚性-非刚性耦合形变框架,将时空高斯(STG)与线性混合蒙皮(LBS)结合:LBS驱动全局姿态变换以实现实时控制,STG通过时空自适应优化3D高斯分布;并利用光流识别高动态区域,指导3D高斯的自适应稠密化。 Result: 实验表明,该方法在重建质量与运行效率上均优于现有最先进方法,定量指标更优且保持实时渲染能力。 Conclusion: STG-Avatar通过融合LBS与STG的优势,实现了从单目视频中高质量、可动画化的人体avatar重建,显著提升了对复杂形变和动态区域的表达能力。 Abstract: Realistic animatable human avatars from monocular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based human avatars has made progress, it still struggles with accurately representing detailed features of non-rigid objects (e.g., clothing deformations) and dynamic regions (e.g., rapidly moving limbs). To address these challenges, we present STG-Avatar, a 3DGS-based framework for high-fidelity animatable human avatar reconstruction. Specifically, our framework introduces a rigid-nonrigid coupled deformation framework that synergistically integrates Spacetime Gaussians (STG) with linear blend skinning (LBS). In this hybrid design, LBS enables real-time skeletal control by driving global pose transformations, while STG complements it through spacetime adaptive optimization of 3D Gaussians. Furthermore, we employ optical flow to identify high-dynamic regions and guide the adaptive densification of 3D Gaussians in these regions. Experimental results demonstrate that our method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quantitative metrics while retaining real-time rendering capabilities. Our code is available at https://github.com/jiangguangan/STG-Avatar

[179] LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction

Yuhang Gao,Xiang Xiang,Sheng Zhong,Guoyou Wang

Main category: cs.CV

TL;DR: 提出了一种名为LOC的通用语言引导框架,适用于多种占用网络,支持监督和自监督学习范式,通过融合多帧LiDAR点、泊松重建和KNN语义分配来实现3D场景理解,并引入密集对比学习(DCL)以增强开放集识别性能。

Details Motivation: 由于3D数据集有限,现有视觉语言模型在3D场景理解中的应用受限,因此需要一种能有效利用少量或无标注数据进行开放集识别的通用框架。 Method: 提出LOC框架,融合多帧LiDAR点云,使用泊松重建填补空洞,通过KNN为体素分配语义;引入Densely Contrastive Learning(DCL),利用体素级语义信息和文本提示进行高维特征蒸馏,避免特征过度同质化;将体素特征嵌入CLIP空间,结合文本与图像信息进行分类。 Result: 在nuScenes数据集上实验表明,该方法在已知类别上实现高精度预测,同时无需额外训练即可区分未知类别,显著提升开放集识别性能。 Conclusion: LOC框架有效解决了3D场景理解中数据稀缺的问题,通过自监督与监督学习结合的方式,在无需密集像素级标注的情况下实现了强大的开放集识别能力。 Abstract: Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.

[180] Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation

Renrong Shao,Wei Zhang,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于对比学习的注意力残差融合网络(ARFNet),用于解决无源域数据的领域自适应(SFDA)中负迁移和域偏移问题,在多个基准上取得了优异性能。

Details Motivation: 现有SFDA方法多关注域偏移,忽视负迁移的影响,导致模型性能提升受限。 Method: 提出ARFNet框架,包含注意力残差融合、全局-局部注意力对比和动态中心评估策略,通过分解空间与通道注意力实现跨层融合与自蒸馏,并利用对比学习增强类别判别能力。 Result: 在五个不同规模的基准上进行实验,结果表明该方法优于现有技术,在SFDA任务中表现更优。 Conclusion: ARFNet有效缓解了负迁移和域偏移问题,提升了SFDA的性能,具有较强的泛化能力和应用潜力。 Abstract: Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation. The complexity of scene information and lack of the source domain make SFDA a difficult task. Recent studies have shown promising results, but many approaches to domain adaptation concentrate on domain shift and neglect the effects of negative transfer, which may impede enhancements of model performance during adaptation. n this paper, addressing this issue, we propose a novel framework of Attention Residual Fusion Network (ARFNet) based on contrast learning for SFDA to alleviate negative transfer and domain shift during the progress of adaptation, in which attention residual fusion, global-local attention contrast, and dynamic centroid evaluation are exploited. Concretely, the attention mechanism is first exploited to capture the discriminative region of the target object. Then, in each block, attention features are decomposed into spatial-wise and channel-wise attentions to achieve the cross-layer attention residual fusion progressively and self-distillation. During adaptation progress, we contrast global and local representations to improve the perceptual capabilities of different categories, which enables the model to discriminate variations between inner-class and intra-class. Finally, a dynamic centroid evaluation strategy is exploited to evaluate the trustworthy centroids and labels for self-supervised self-distillation, which aims to accurately approximate the center of the source domain and pseudo-labels to mitigate domain shift. To validate the efficacy, we execute comprehensive experiments on five benchmarks of varying scales. Experimental outcomes indicate that our method surpasses other techniques, attaining superior performance across SFDA benchmarks.

[181] I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions

Shuhong Liu,Lin Gu,Ziteng Cui,Xuangeng Chu,Tatsuya Harada

Main category: cs.CV

TL;DR: 提出I2-NeRF,一种增强在介质退化条件下等距和各向同性度量感知的神经辐射场新框架。

Details Motivation: 使生成式AI具备3D物理世界感知能力,特别是在水下、雾霾和低光等复杂介质环境中提升重建的真实性和物理合理性。 Method: 引入反向分层上采样策略实现3D空间近似均匀采样以保持等距性,并提出统一发射、吸收和散射的辐射模型,基于Beer-Lambert衰减定律建模介质影响。 Result: 在真实数据集上的实验表明,该方法显著提升了重建保真度和物理合理性,并能估计介质属性(如水深)。 Conclusion: I2-NeRF通过改进采样策略和辐射建模,在复杂介质环境下实现了更优的3D感知与重建性能。 Abstract: Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metric perception under media degradation. While existing NeRF models predominantly rely on object-centric sampling, I2-NeRF introduces a reverse-stratified upsampling strategy to achieve near-uniform sampling across 3D space, thereby preserving isometry. We further present a general radiative formulation for media degradation that unifies emission, absorption, and scattering into a particle model governed by the Beer-Lambert attenuation law. By composing the direct and media-induced in-scatter radiance, this formulation extends naturally to complex media environments such as underwater, haze, and even low-light scenes. By treating light propagation uniformly in both vertical and horizontal directions, I2-NeRF enables isotropic metric perception and can even estimate medium properties such as water depth. Experiments on real-world datasets demonstrate that our method significantly improves both reconstruction fidelity and physical plausibility compared to existing approaches.

[182] HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

Erum Mushtaq,Zalan Fabian,Yavuz Faruk Bakman,Anil Ramakrishna,Mahdi Soltanolkotabi,Salman Avestimehr

Main category: cs.CV

TL;DR: 本文提出了一种新的不确定性估计框架HARMONY,通过联合利用视觉语言模型(VLM)的隐藏表示和输出分布来评估生成结果的可靠性,在多个VQA基准上实现了优于现有方法的性能。

Details Motivation: 现有的基于概率或隐藏表示的不确定性估计方法难以捕捉多模态语义关系且易受语言先验影响,导致对VLM输出可靠性的评估不准确。 Method: 提出HARMONY框架,联合利用VLM在推理过程中的融合多模态激活信息和输出token的概率分布,结合模型内部视觉理解置信度与输出概率进行不确定性估计。 Result: 在A-OKVQA、VizWiz和PathVQA三个数据集上,结合LLaVa-7b、LLaVA-13b和InstructBLIP三种主流VLM进行实验,HARMONY在AUROC上最高提升4%,PRR上提升6%,性能优于现有方法。 Conclusion: 同时利用模型内部表示和输出分布能更有效地捕捉可靠性信号,HARMONY为VLM的不确定性估计提供了新的有效范式。 Abstract: The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model's internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4\% improvement in AUROC, and 6\% in PRR, establishing new state of the art in uncertainty estimation for VLMs.

[183] Scaling Non-Parametric Sampling with Representation

Vincent Lu,Aaron Truong,Zeyu Yun,Yubei Chen

Main category: cs.CV

TL;DR: 提出了一种简单、非参数的生成模型,基于自然图像的三个原则,在无需训练的情况下生成高质量图像,并揭示了“部分-整体泛化”的组合机制。

Details Motivation: 尽管现有的图像生成模型已非常逼真,但其机制不透明;本文旨在去除复杂工程技巧,探索自然图像结构的最小理论。 Method: 基于自然图像的三个特性(空间非平稳性、低层规律性和高层语义),通过局部上下文窗口定义每个像素的分布,构建一个无需训练的非参数生成模型。 Result: 在MNIST和CIFAR-10上生成了高保真且视觉吸引的图像,模型具有白盒性质,可通过追踪像素来源分析生成机制。 Conclusion: 该模型展示了简单设计与强性能的结合,揭示了‘部分-整体泛化’的组合过程,为理解大模型如何泛化提供了假设。 Abstract: Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.

[184] MOGRAS: Human Motion with Grasping in 3D Scenes

Kunal Bhosikar,Siddharth Katageri,Vivek Madhavaram,Kai Han,Charu Sharma

Main category: cs.CV

TL;DR: 本文提出了MOGRAS数据集,用于生成在3D场景中具有物理合理性的全身抓取动作,填补了现有方法在场景感知与精细抓取之间的空白,并提出了一种有效的方法来改进现有模型的场景适应能力。

Details Motivation: 现有方法在生成全身运动时缺乏对精细抓取动作的精确建模,而精确抓取方法又忽略3D场景上下文,导致难以生成既真实又符合物理规律的全身抓取动作。 Method: 构建了一个大规模数据集MOGRAS,包含丰富的3D室内场景标注、预抓取行走动作和最终抓取姿态,并基于该数据集提出一种简单但有效的方法,使现有方法能更好地适应3D场景。 Result: 通过定量与定性实验验证了MOGRAS的有效性,揭示了现有方法的局限性,并展示了所提方法在生成场景感知的全身抓取动作上的显著提升。 Conclusion: MOGRAS为实现更真实的全身人-物-场景交互提供了重要基础,推动了机器人、虚拟现实等领域的发展。 Abstract: Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.

[185] LongCat-Video Technical Report

Meituan LongCat Team,Xunliang Cai,Qilong Huang,Zhuoliang Kang,Hongyu Li,Shijun Liang,Liya Ma,Siyu Ren,Xiaoming Wei,Rixu Xie,Tong Zhang

Main category: cs.CV

TL;DR: LongCat-Video是一个具有13.6B参数的视频生成基础模型,支持多种任务并能高效生成高质量长视频。

Details Motivation: 推动世界模型的发展,实现高效的长视频生成。 Method: 基于Diffusion Transformer框架,采用统一架构支持多任务,并通过预训练和块稀疏注意力机制提升长视频生成效率与质量。 Result: 在文本到视频、图像到视频和视频延续任务上表现优异,能够在几分钟内生成720p、30fps的长视频,且保持高质量和时间连贯性。 Conclusion: LongCat-Video是迈向世界模型的重要一步,具备高效推理能力和强大的多任务性能。 Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

[186] TrajGATFormer: A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments

Mohammed Alduais,Xinming Li,Qipei Mei

Main category: cs.CV

TL;DR: 本文提出了一种结合YOLOv10n和DeepSORT的框架,以及两种新的轨迹预测模型TrajGATFormer和TrajGATFormer-Obstacle,用于提升建筑工地中工人和障碍物轨迹预测的准确性,从而改善碰撞避免系统。

Details Motivation: 由于建筑环境中工人、机械和移动障碍物之间的密切互动带来了新的安全风险,传统方法难以适应动态环境,且现有数据驱动方法在捕捉长期行为和空间社交上下文方面存在不足,因此需要更精确的轨迹预测模型。 Method: 采用YOLOv10n进行目标检测,DeepSORT实现多目标跟踪,并构建基于Transformer编码器-解码器和图注意力网络(GAT)的TrajGATFormer和TrajGATFormer-Obstacle模型来预测工人及障碍物的未来轨迹。 Result: TrajGATFormer在4.8秒预测范围内达到ADE 1.25米、FDE 2.3米;TrajGATFormer-Obstacle进一步将ADE降至1.15米、FDE 2.2米,相比传统方法ADE和FDE分别最多降低35%和38%。 Conclusion: 所提出的框架和模型在建筑场景下的轨迹预测中表现出更高精度,能够有效捕捉时空和社会交互特征,显著优于传统方法,有助于提升施工现场的安全性。 Abstract: As the demand grows within the construction industry for processes that are not only faster but also safer and more efficient, offsite construction has emerged as a solution, though it brings new safety risks due to the close interaction between workers, machinery, and moving obstacles. Predicting the future trajectories of workers and taking into account social and environmental factors is a crucial step for developing collision-avoidance systems to mitigate such risks. Traditional methods often struggle to adapt to the dynamic and unpredictable nature of construction environments. Many rely on simplified assumptions or require hand-crafted features, limiting their ability to respond to complex, real-time interactions between workers and moving obstacles. While recent data-driven methods have improved the modeling of temporal patterns, they still face challenges in capturing long-term behavior and accounting for the spatial and social context crucial to collision risk assessment. To address these limitations, this paper proposes a framework integrating YOLOv10n and DeepSORT for precise detection and tracking, along with two novel trajectory prediction models: TrajGATFormer and TrajGATFormer-Obstacle. YOLOv10n serves as the backbone for object detection, accurately identifying workers and obstacles in diverse scenes, while DeepSORT efficiently tracks them over time with unique IDs for continuity. Both models employ a transformer encoder-decoder with Graph Attention Networks (GAT) to capture temporal and spatial interactions. TrajGATFormer predicts worker trajectories with an ADE of 1.25 m and FDE of 2.3 m over a 4.8 s horizon, while TrajGATFormer-Obstacle extends prediction to both workers and obstacles, achieving higher accuracy (ADE 1.15 m, FDE 2.2 m). Comparative analysis shows both models outperform traditional methods, reducing ADE and FDE by up to 35% and 38%, respectively.

[187] DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum

Yaokun Li,Lihe Ding,Xiao Chen,Guang Tan,Tianfan Xue

Main category: cs.CV

TL;DR: 本文提出了DynamicTree,首个能够生成长期、交互式3D高斯点阵树动画的框架,利用紧凑的稀疏体素谱表示树木运动,实现快速前馈生成和实时交互响应。

Details Motivation: 现有方法在生成复杂真实树木的4D动态方面仍面临挑战,难以兼顾真实性与计算效率。 Method: 提出DynamicTree框架,使用稀疏体素谱生成网格运动,并将高斯点绑定到变形网格上;同时引入4DTree大规模合成数据集用于训练。 Result: 实验表明该方法在视觉质量和计算效率上均显著优于现有方法,支持长时间、交互式的逼真树木动画生成。 Conclusion: DynamicTree为3D树木的动态建模提供了高效且真实的解决方案,适用于虚拟现实、游戏和世界模拟等应用。 Abstract: Generating dynamic and interactive 3D objects, such as trees, has wide applications in virtual reality, games, and world simulation. Nevertheless, existing methods still face various challenges in generating realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive animation of 3D Gaussian Splatting trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing 8,786 animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.

[188] GALA: A GlobAl-LocAl Approach for Multi-Source Active Domain Adaptation

Juepeng Zheng,Peifeng Zhang,Yibin Wen,Qingmei Li,Yang Zhang,Haohuan Fu

Main category: cs.CV

TL;DR: 本文提出了一个名为GALA的多源主动域自适应(MS-ADA)策略,通过结合全局聚类与局部选择标准,在仅使用1%目标域标注的情况下,性能接近全监督上限。

Details Motivation: 现有域自适应方法与全监督学习之间仍存在较大性能差距,且多源域适应中缺乏有效利用目标域少量标注的机制。 Method: 提出GALA策略,先对目标域样本进行全局k-means聚类,再在每个簇内应用局部选择准则,以联合应对类别多样性和多源域差异问题;该方法即插即用,不引入额外可训练参数。 Result: 在三个标准域自适应基准上实验表明,GALA consistently 优于现有的主动学习和主动域自适应方法。 Conclusion: GALA是一种简单而有效的多源主动域自适应方法,能够在极少目标标注下显著提升性能,缩小与全监督学习的差距。 Abstract: Domain Adaptation (DA) provides an effective way to tackle target-domain tasks by leveraging knowledge learned from source domains. Recent studies have extended this paradigm to Multi-Source Domain Adaptation (MSDA), which exploits multiple source domains carrying richer and more diverse transferable information. However, a substantial performance gap still remains between adaptation-based methods and fully supervised learning. In this paper, we explore a more practical and challenging setting, named Multi-Source Active Domain Adaptation (MS-ADA), to further enhance target-domain performance by selectively acquiring annotations from the target domain. The key difficulty of MS-ADA lies in designing selection criteria that can jointly handle inter-class diversity and multi-source domain variation. To address these challenges, we propose a simple yet effective GALA strategy (GALA), which combines a global k-means clustering step for target-domain samples with a cluster-wise local selection criterion, effectively tackling the above two issues in a complementary manner. Our proposed GALA is plug-and-play and can be seamlessly integrated into existing DA frameworks without introducing any additional trainable parameters. Extensive experiments on three standard DA benchmarks demonstrate that GALA consistently outperforms prior active learning and active DA methods, achieving performance comparable to the fully-supervised upperbound while using only 1% of the target annotations.

[189] Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need

Yongchuan Cui,Peng Liu,Hui Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为UniPAN的统一分布策略,通过分布变换函数使不同来源的遥感数据服从相同分布,从而提升深度 pansharpening 模型在未见卫星数据上的泛化能力,实现“一次训练,永久部署”的目标。

Details Motivation: 现有深度学习模型在训练数据上表现良好,但在面对不同传感器和成像条件的未见数据时性能显著下降,缺乏泛化能力。其主要原因是训练与测试数据之间的分布差异。 Method: 构建一个分布变换函数,将来自不同源的像素归一化到同一分布;模型在变换后的数据上训练,测试时也将新数据进行相同变换以匹配训练分布,从而实现训练与测试分布的一致性。 Result: 实验表明,UniPAN能显著提升多种深度pansharpening模型在不同卫星传感器上的性能,具有良好的泛化能力和应用潜力。 Conclusion: 通过统一训练和测试数据的分布,UniPAN有效解决了模型在跨域遥感数据上性能下降的问题,为实现通用、鲁棒的遥感图像处理提供了可行方案。 Abstract: Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging conditions, these models suffer from substantial performance degradation when applied to unseen satellite data, lacking generalizability and thus limiting their applicability. We argue that the performance drops stem primarily from distributional discrepancies from different sources and the key to addressing this challenge lies in bridging the gap between training and testing distributions. To validate the idea and further achieve a "train once, deploy forever" capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). Specifically, we construct a distribution transformation function that normalizes the pixels sampled from different sources to conform to an identical distribution. The deep models are trained on the transformed domain, and during testing on new datasets, the new data are also transformed to match the training distribution. UniPAN aims to train and test the model on a unified and consistent distribution, thereby enhancing its generalizability. Extensive experiments validate the efficacy of UniPAN, demonstrating its potential to significantly enhance the performance of deep pansharpening models across diverse satellite sensors. Codes: https://github.com/yc-cui/UniPAN.

[190] Audio Frequency-Time Dual Domain Evaluation on Depression Diagnosis

Yu Luo,Nan Huang,Sophie Yu,Hendry Xu,Jerry Wang,Colin Wang,Zhichao Liu,Chen Zeng

Main category: cs.CV

TL;DR: 本研究利用语音信号的时频双域多模态特征,结合深度学习模型,提出一种用于抑郁症智能评估与诊断的算法,实验结果表明该方法在抑郁症分类任务中表现优异。

Details Motivation: 抑郁症的诊断过程复杂、标准模糊且就诊率低,导致难以及时评估和干预,因此需要更有效的智能诊断方法。 Method: 采用语音作为生理信号,提取其时频双域多模态特征,并结合深度学习模型构建抑郁症智能评估与诊断算法。 Result: 所提方法在抑郁症分类任务中表现出色,具有较高的诊断准确性和应用潜力。 Conclusion: 基于语音信号和深度学习的智能诊断方法为抑郁症的筛查和诊断提供了新的有效途径。 Abstract: Depression, as a typical mental disorder, has become a prevalent issue significantly impacting public health. However, the prevention and treatment of depression still face multiple challenges, including complex diagnostic procedures, ambiguous criteria, and low consultation rates, which severely hinder timely assessment and intervention. To address these issues, this study adopts voice as a physiological signal and leverages its frequency-time dual domain multimodal characteristics along with deep learning models to develop an intelligent assessment and diagnostic algorithm for depression. Experimental results demonstrate that the proposed method achieves excellent performance in the classification task for depression diagnosis, offering new insights and approaches for the assessment, screening, and diagnosis of depression.

[191] Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

Jeongin Kim,Wonho Bae,YouLee Han,Giyeong Oh,Youngjae Yu,Danica J. Sutherland,Junhyug Noh

Main category: cs.CV

TL;DR: 本文提出了一种用于语义分割的低预算主动学习方法,采用两阶段选择流程,结合扩散模型提取多尺度特征,并通过解耦多样性和不确定性来高效选择最具信息量的像素进行标注。

Details Motivation: 语义分割需要密集的像素级标注,成本高昂,尤其是在标注预算极低的情况下。现有方法难以在极低预算下保持高精度,因此需要更高效的主动学习策略。 Method: 提出一种两阶段选择管道:第一阶段利用预训练扩散模型提取多尺度特征,使用MaxHerding进行分层表示选择以获取代表性像素;第二阶段计算基于噪声多尺度扩散特征的熵增强分歧得分(eDALD),结合认知不确定性和预测置信度选择最优标注像素。 Result: 在四个基准数据集(CamVid、ADE-Bed、Cityscapes、Pascal-Context)上的实验表明,该方法在极低像素标注预算下显著优于现有基线方法。 Conclusion: 该两阶段主动学习框架能以极少的标注像素实现高分割精度,有效解决了低预算下的语义分割标注效率问题。 Abstract: Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at https://github.com/jn-kim/two-stage-edald.

[192] DiffusionLane: Diffusion Model for Lane Detection

Kunyang Zhou,Yeqin Shao

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的车道线检测方法DiffusionLane,将车道线检测视为在车道参数空间中的去噪扩散过程,并通过混合解码策略和辅助头提升特征表示,实验表明其在多个基准上具有优异的性能和泛化能力。

Details Motivation: 为了提升车道线检测的准确性和泛化能力,尤其在存在噪声和复杂场景下的鲁棒性,作者希望探索基于扩散模型的新范式来逐步优化车道参数。 Method: 将真实车道的起始点和角度参数加入高斯噪声生成带噪锚点,模型通过学习逐步去噪恢复目标车道;设计混合扩散解码器融合全局与局部特征,并引入训练阶段的辅助头增强编码器监督。 Result: 在Carlane、Tusimple、CULane和LLAMAS四个数据集上取得优于现有方法的结果,例如在Carlane上使用ResNet18超过先前方法至少1%准确率,CULane上MobileNetV4达到81.32% F1,Tusimple上ResNet34达96.89%准确率,LLAMAS上ResNet101达97.59% F1。 Conclusion: DiffusionLane通过扩散机制建模车道参数优化过程,结合混合解码结构和辅助监督,显著提升了车道检测的性能和泛化能力,为该任务提供了一种新的有效框架。 Abstract: In this paper, we present a novel diffusion-based model for lane detection, called DiffusionLane, which treats the lane detection task as a denoising diffusion process in the parameter space of the lane. Firstly, we add the Gaussian noise to the parameters (the starting point and the angle) of ground truth lanes to obtain noisy lane anchors, and the model learns to refine the noisy lane anchors in a progressive way to obtain the target lanes. Secondly, we propose a hybrid decoding strategy to address the poor feature representation of the encoder, resulting from the noisy lane anchors. Specifically, we design a hybrid diffusion decoder to combine global-level and local-level decoders for high-quality lane anchors. Then, to improve the feature representation of the encoder, we employ an auxiliary head in the training stage to adopt the learnable lane anchors for enriching the supervision on the encoder. Experimental results on four benchmarks, Carlane, Tusimple, CULane, and LLAMAS, show that DiffusionLane possesses a strong generalization ability and promising detection performance compared to the previous state-of-the-art methods. For example, DiffusionLane with ResNet18 surpasses the existing methods by at least 1\% accuracy on the domain adaptation dataset Carlane. Besides, DiffusionLane with MobileNetV4 gets 81.32\% F1 score on CULane, 96.89\% accuracy on Tusimple with ResNet34, and 97.59\% F1 score on LLAMAS with ResNet101. Code will be available at https://github.com/zkyntu/UnLanedet.

[193] Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework

Amir Mohammad Khadem Hosseini,Sattar Mirzakuchaki

Main category: cs.CV

TL;DR: 本文提出了一种基于FPGA的实时语义分割实现方法,采用轻量级LMIINet架构和CGRA4ML硬件框架,在8位量化感知训练下实现了高精度与高效计算,在ZCU104开发板上达到20 FPS和90%像素准确率、45% mIoU,展示了在功耗效率方面优于传统GPU方案的潜力。

Details Motivation: 实现实时语义分割在自动驾驶等应用中至关重要,但需在保证高精度的同时满足计算资源和硬件限制,现有方法在效率与性能之间难以平衡。 Method: 采用轻量级LMIINet网络,结合CGRA4ML可重构硬件架构,使用8位量化感知训练压缩模型,并对跳接连接、深度可分离卷积、1A-1卷积及Flatten Transformer模块进行硬件适配优化。 Result: 在Cityscapes数据集上实现约90%像素准确率和45% mIoU,于ZCU104 FPGA上达到20帧/秒的处理速度和50.1毫秒延迟,内存占用减少四倍。 Conclusion: CGRA4ML框架结合量化与硬件优化策略,能够在FPGA上高效运行先进的语义分割模型,在保持竞争力精度的同时显著提升能效,为实时应用提供了超越传统GPU的替代方案。 Abstract: Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real-time applications such as autonomous driving. The main challenge is achieving high accuracy while operating under computational and hardware constraints. In this research, we present an FPGA-based implementation of real-time semantic segmentation leveraging the lightweight LMIINet architecture and the Coarse-Grained Reconfigurable Array for Machine Learning (CGRA4ML) hardware framework. The model was trained using Quantization-Aware Training (QAT) with 8-bit precision on the Cityscapes dataset, reducing memory footprint by a factor of four while enabling efficient fixed-point computations. Necessary modifications were applied to adapt the model to CGRA4ML constraints, including simplifying skip connections, employing hardware-friendly operations such as depthwise-separable and 1A-1 convolutions, and redesigning parts of the Flatten Transformer. Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU), operating in real-time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board. The results demonstrate the potential of CGRA4ML, with its flexibility in mapping modern layers and off-chip memory utilization for skip connections, provides a path for implementing advanced semantic segmentation networks on FPGA for real-time applications to outperform traditional GPU solutions in terms of power efficiency while maintaining competitive accuracy. The code for this project is publicly available at https://github.com/STAmirr/ cgra4ml_semantic_segmentation

[194] Accident Anticipation via Temporal Occurrence Prediction

Tianhao Zhao,Yiyang Zou,Zihao Mao,Peilun Xiao,Yulin Huang,Hongda Yang,Yuxuan Li,Qun Li,Guobin Wu,Yutian Lin

Main category: cs.CV

TL;DR: 提出一种新的事故预测范式,通过精确标注的事故时间戳监督,直接估计多个未来时间步的事故得分,相较于传统帧级风险评分方法更准确可靠。

Details Motivation: 现有方法使用模糊的二元监督(将事故视频中所有帧标记为正样本),导致学习不可靠和误报较多,无法反映风险随时间连续变化的特性。 Method: 采用片段级编码器联合建模时空动态,并设计基于Transformer的时间解码器,利用专用时间查询同时预测多个未来时间步的事故得分。 Result: 在真实场景下,新方法在不同误报率约束下均显著提升了召回率和事故前时间(TTA)性能。 Conclusion: 该方法通过精细化监督信号和多步预测机制,有效提高了事故预见的准确性和实用性,具备实际应用潜力。 Abstract: Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as indicators of hazard. However, these approaches rely on ambiguous binary supervision (labeling all frames in accident videos as positive) despite the fact that risk varies continuously over time, leading to unreliable learning and false alarms. To address this, we propose a novel paradigm that shifts the prediction target from current-frame risk scoring to directly estimating accident scores at multiple future time steps (e.g., 0.1s-2.0s ahead), leveraging precisely annotated accident timestamps as supervision. Our method employs a snippet-level encoder to jointly model spatial and temporal dynamics, and a Transformer-based temporal decoder that predicts accident scores for all future horizons simultaneously using dedicated temporal queries. Furthermore, we introduce a refined evaluation protocol that reports Time-to-Accident (TTA) and recall (evaluated at multiple pre-accident intervals (0.5s, 1.0s, and 1.5s)) only when the false alarm rate (FAR) remains within an acceptable range, ensuring practical relevance. Experiments show that our method achieves superior performance in both recall and TTA under realistic FAR constraints.

[195] GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification

Qiao Li,Jie Li,Yukang Zhang,Lei Tan,Jing Chen,Jiayi Ji

Main category: cs.CV

TL;DR: 提出了一种用于空中-地面行人重识别(AG-ReID)的几何与语义对齐网络GSAlign,通过可学习的薄板样条模块和动态对齐模块解决极端视角差异、遮挡和空间错位问题,在CARGO数据集上显著优于现有方法。

Details Motivation: 现有方法在处理AG-ReID任务时难以应对严重的姿态变化和空间错位,且受视角差异和遮挡影响较大,需更有效的对齐机制。 Method: 提出GSAlign网络,包含Learnable Thin Plate Spline(LTPS)模块用于自适应校正几何形变,以及Dynamic Alignment Module(DAM)生成可见性感知的语义掩码以缓解遮挡影响,实现几何与语义联合对齐。 Result: 在CARGO数据集四个协议下评估,相比先前最优方法,mAP提升18.8%,Rank-1准确率提升16.8%。 Conclusion: GSAlign有效解决了AG-ReID中的几何失真与语义错位问题,显著提升了跨视角行人匹配性能。 Abstract: Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8\% in mAP and +16.8\% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting. The code is available at: \textcolor{magenta}{https://github.com/stone96123/GSAlign}.

[196] WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models

Issa Sugiura,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Yasuo Okabe,Naoaki Okazaki

Main category: cs.CV

TL;DR: 本文介绍了一个大规模、高质量的日语图文对数据集WAON,包含约1.55亿个样本,并提出了一个用于评估日语文化图像分类的基准WAON-Bench。实验表明,在WAON上微调的SigLIP2模型在多个日语文化基准上表现优于现有数据集,达到最先进水平。

Details Motivation: 为了推动高性能视觉-语言模型的发展,需要大规模且高质量的多语言图像-文本对数据集,尤其是针对日语等资源较少的语言。现有数据集在质量和适用性方面存在不足,因此作者构建了专门针对日语的高质量图文数据集WAON。 Method: 作者从Common Crawl中收集约1.55亿日语图文对,通过过滤和去重等技术构建WAON数据集;同时构建了一个手工标注的日本文化图像分类基准WAON-Bench(374类),并通过在WAON和ReLAION日语子集上微调SigLIP2模型进行对比评估。 Result: 在WAON上微调的模型在WAON-Bench及其他日语文化基准上均优于在ReLAION日语子集上训练的模型,展现出更高的准确率和训练效率,并达到多项任务的最先进性能。 Conclusion: WAON是一个高质量、大规模的日语图文对数据集,能有效提升视觉-语言模型在日语文化理解任务上的性能,具有良好的应用前景和研究价值。 Abstract: Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. Our dataset construction pipeline employs various techniques, including filtering and deduplication, which have been shown to be effective in previous studies. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification, consisting of 374 classes. To assess the effectiveness of our dataset, we conduct experiments using both WAON and the Japanese subset of ReLAION, one of the most widely used vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on both datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks. We release our dataset, model, and code at https://speed1313.github.io/WAON.

[197] CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

Tianhui Liu,Hetian Pang,Xin Zhang,Jie Feng,Yong Li,Pan Hui

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的框架CityRiSE,用于提升大型视觉-语言模型在城市社会经济状态感知中的准确性与可解释性。

Details Motivation: 现有大型视觉-语言模型在从视觉数据中进行社会经济预测时存在准确性和可解释性不足的问题。 Method: 提出CityRiSE框架,通过纯强化学习方法,结合精心设计的多模态数据和可验证的奖励机制,引导模型关注语义上有意义的视觉线索,实现结构化、目标导向的推理。 Result: 实验表明,CityRiSE在预测精度和跨城市、跨指标的泛化能力上显著优于现有基线方法,尤其在未见过的城市和指标上表现突出。 Conclusion: 强化学习与大型视觉-语言模型的结合在可解释且通用的城市社会经济感知中具有巨大潜力。 Abstract: Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.

[198] GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Jing Wang,Jiajun Liang,Jie Liu,Henglin Liu,Gongye Liu,Jun Zheng,Wanyuan Pang,Ao Ma,Zhenyu Xie,Xintao Wang,Meng Wang,Pengfei Wan,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出了GRPO-Guard,一种针对GRPO强化学习框架中重要性比率分布偏移问题的改进方法,通过比率归一化和梯度重加权策略,有效缓解了隐式过优化问题,提升了生成质量和任务对齐能力。

Details Motivation: 在GRPO框架中,重要性比率分布存在均值偏低、方差不一致的问题,导致正优势样本无法进入裁剪区域,从而使PPO裁剪机制失效,引发策略模型的隐式过优化。 Method: 提出GRPO-Guard,引入比率归一化以恢复重要性比率的平衡性和时间步一致性,并采用梯度重加权策略均衡不同噪声条件下的策略梯度,防止特定时间步区域的过度更新。 Result: 在多个扩散模型(如SD3.5M、Flux.1-dev)和多种代理任务上的实验表明,GRPO-Guard显著减少了过优化现象,在不依赖强KL正则化的情况下保持甚至提升了生成质量。 Conclusion: GRPO-Guard通过规范化重要性比率和调节梯度更新,提供了一种简单而有效的机制来稳定GRPO框架的训练过程,解决了隐式过优化问题,增强了模型在实际应用中的可行性。 Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.

[199] Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning

Ali Javidani,Babak Nadjar Araabi,Mohammad Amin Sadeghi

Main category: cs.CV

TL;DR: 本文提出了一种将图论融入自监督表示学习的新方法,通过构建KNN图捕捉实例间关系,并利用图神经网络进行多跳消息传递以增强上下文整合,在多个基准数据集上显著优于现有方法。

Details Motivation: 传统自监督学习方法主要关注实例内的变化,忽略了实例间的潜在关系,本文旨在通过引入图结构来建模这些重要的跨实例关联。 Method: 在教师和学生流中构建k近邻(KNN)图,节点表示样本及其潜在表示,边表示实例间的相似性;预训练后使用图神经网络进行多跳消息传播,实现表示优化。 Result: 在CIFAR-10、ImageNet-100和ImageNet-1K上分别取得了7.3%、3.2%和1.0%的准确率提升,验证了所提图机制的有效性。 Conclusion: 通过引入KNN图和图神经网络进行表示学习与优化,能够有效捕获实例间关系,显著提升自监督学习性能。 Abstract: This paper introduces a novel approach that integrates graph theory into self-supervised representation learning. Traditional methods focus on intra-instance variations generated by applying augmentations. However, they often overlook important inter-instance relationships. While our method retains the intra-instance property, it further captures inter-instance relationships by constructing k-nearest neighbor (KNN) graphs for both teacher and student streams during pretraining. In these graphs, nodes represent samples along with their latent representations. Edges encode the similarity between instances. Following pretraining, a representation refinement phase is performed. In this phase, Graph Neural Networks (GNNs) propagate messages not only among immediate neighbors but also across multiple hops, thereby enabling broader contextual integration. Experimental results on CIFAR-10, ImageNet-100, and ImageNet-1K demonstrate accuracy improvements of 7.3%, 3.2%, and 1.0%, respectively, over state-of-the-art methods. These results highlight the effectiveness of the proposed graph based mechanism. The code is publicly available at https://github.com/alijavidani/SSL-GraphNNCLR.

[200] Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

Xu Zhang,Ruijie Quan,Wenguan Wang,Yi Yang

Main category: cs.CV

TL;DR: 提出MindHier,一种基于尺度自回归的fMRI到图像重建框架,通过层次化神经信息编码与分层对齐机制,在语义保真度、推理速度和确定性上优于扩散模型。

Details Motivation: 现有扩散方法使用固定的高层fMRI嵌入作为生成引导,忽略了神经信息的层次结构和图像重建过程中不同阶段的需求差异,导致语义丢失和效率低下。 Method: 设计MindHier框架,包含三个核心组件:层次化fMRI编码器提取多级神经表征,层次到层次对齐策略实现与CLIP特征的逐层对应,以及尺度感知的粗到精神经引导策略在匹配尺度注入嵌入。 Result: 在NSD数据集上实验表明,MindHier相比扩散基线方法语义保真度更高,推理速度快4.67倍,并产生更确定的结果。 Conclusion: MindHier通过模拟人类视觉感知的由粗到细过程,提供了一种高效且认知对齐的fMRI图像重建新范式。 Abstract: Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.

[201] GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation

Phillip Mueller,Talip Uenlue,Sebastian Schmidt,Marcel Kollovieh,Jiajie Fan,Stephan Guennemann,Lars Mikelsons

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架GeoDiffusion,用于在图像生成中实现对3D特征的精确几何控制,通过3D几何先验和风格迁移实现高效、准确的编辑。

Details Motivation: 传统3D编辑耗时且需要专业技能,现有基于图像的生成方法在几何条件控制上精度不足。 Method: 利用类别特定的3D对象作为几何先验定义3D关键点和参数关联,通过渲染参考3D对象图像保证视角一致性,并结合风格迁移满足外观需求;核心为GeoDrag方法,提升基于拖拽的图像编辑在几何引导任务中的准确性和速度。 Result: 实验结果表明,GeoDiffusion能在多种迭代设计工作流中实现精确的几何修改,并在DragBench上表现出对一般指令的良好适应性。 Conclusion: GeoDiffusion是一种高效、无需训练的图像生成几何控制框架,能够在保持高精度的同时实现快速、用户可控的3D感知图像编辑。 Abstract: Precise geometric control in image generation is essential for engineering \& product design and creative industries to control 3D object features accurately in image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.

[202] EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model

Changhao Zhang,Matthew J. Clarkson,Mobarak I. Hoque

Main category: cs.CV

TL;DR: 本文提出了一种结合内参估计的自监督单目深度估计方法,用于增强内窥镜手术场景的3D重建,通过改进Depth Anything V2模型实现深度、姿态和内参的联合预测,在公开数据集上表现优于现有方法。

Details Motivation: 在真实手术环境中,由于无菌要求和变焦/旋转内窥镜的使用,传统内参标定困难,导致现有3D重建方法精度受限,因此需要一种能在线估计内参的自适应方法。 Method: 将内参估计集成到自监督单目深度估计框架中,改进Depth Anything V2模型,引入基于注意力机制的姿态网络和权重分解低秩适配(DoRA)策略,实现深度、姿态与内参的联合预测与高效微调。 Result: 在SCARED和C3VD数据集上验证了方法的有效性,深度估计和3D重建性能优于最新的自监督方法,实现了更准确的内参估计与重建效果。 Conclusion: 该方法有效解决了内窥镜手术中内参难以标定的问题,提升了自监督3D重建的精度与实用性,具有临床应用潜力。 Abstract: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery. A critical yet challenging step in this process is the accurate estimation of the endoscope's intrinsic parameters. In real surgical settings, intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation. Most existing methods for endoscopic 3D reconstruction do not estimate intrinsic parameters, limiting their effectiveness for accurate and reliable reconstruction. In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction. We introduce an attention-based pose network and a Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning of DA2. Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction. Code and model weights can be found in project repository: https://github.com/MOYF-beta/EndoSfM3D.

[203] T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Jindong Yang,Han Fang,Weiming Zhang,Nenghai Yu,Kejiang Chen

Main category: cs.CV

TL;DR: 提出T2SMark,一种基于尾部截断采样的两阶段水印方法,在扩散模型中实现了水印鲁棒性与生成多样性的良好平衡。

Details Motivation: 现有噪声水印方法在鲁棒性和生成多样性之间难以兼顾,限制了实际应用。 Method: 采用尾部截断采样(TTS),将水印信息嵌入噪声分布的尾部可靠区域,中心区域随机采样以保持分布特性,并通过引入会话密钥的两阶段框架增强多样性。 Result: 在U-Net和DiT结构的扩散模型上验证了方法有效性,实验表明T2SMark在鲁棒性和多样性方面均表现优异。 Conclusion: T2SMark有效解决了扩散模型中水印鲁棒性与生成多样性之间的权衡问题,具有良好的实用前景。 Abstract: Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{https://github.com/0xD009/T2SMark}{https://github.com/0xD009/T2SMark}.

[204] Efficient Large-Deformation Medical Image Registration via Recurrent Dynamic Correlation

Tianran Li,Marius Staring,Yuchuan Qiao

Main category: cs.CV

TL;DR: 提出了一种基于循环相关性的可变形图像配准框架,通过动态调整匹配区域来高效处理大变形,同时在精度和计算成本之间实现了良好权衡。

Details Motivation: 现有基于深度学习的图像配准方法在处理大变形时效率较低,且局部匹配难以捕捉长距离对应关系,因此需要一种既能保持低计算成本又能有效建模大变形的方法。 Method: 提出Recurrent Correlation-based框架,采用轻量级循环更新模块,动态调整体素到区域的匹配位置;通过局部相关匹配与偏移估计迭代搜索最优对应,并解耦运动相关特征和纹理特征以减少语义冗余。 Result: 在脑部MRI和腹部CT数据集上验证了方法的有效性,在有无仿射预配准两种设置下均达到或优于现有方法的性能,且仅使用9.5%的FLOPs并快96%于RDP方法。 Conclusion: 该方法通过动态搜索策略和特征解耦机制,显著提升了大变形配准的效率与准确性,具有良好的临床应用潜力。 Abstract: Deformable image registration estimates voxel-wise correspondences between images through spatial transformations, and plays a key role in medical imaging. While deep learning methods have significantly reduced runtime, efficiently handling large deformations remains a challenging task. Convolutional networks aggregate local features but lack direct modeling of voxel correspondences, promoting recent works to explore explicit feature matching. Among them, voxel-to-region matching is more efficient for direct correspondence modeling by computing local correlation features whithin neighbourhoods, while region-to-region matching incurs higher redundancy due to excessive correlation pairs across large regions. However, the inherent locality of voxel-to-region matching hinders the capture of long-range correspondences required for large deformations. To address this, we propose a Recurrent Correlation-based framework that dynamically relocates the matching region toward more promising positions. At each step, local matching is performed with low cost, and the estimated offset guides the next search region, supporting efficient convergence toward large deformations. In addition, we uses a lightweight recurrent update module with memory capacity and decouples motion-related and texture features to suppress semantic redundancy. We conduct extensive experiments on brain MRI and abdominal CT datasets under two settings: with and without affine pre-registration. Results show that our method exibits a strong accuracy-computation trade-off, surpassing or matching the state-of-the-art performance. For example, it achieves comparable performance on the non-affine OASIS dataset, while using only 9.5% of the FLOPs and running 96% faster than RDP, a representative high-performing method.

[205] A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction

Aitor Iglesias,Nerea Aranjuelo,Patricia Javierre,Ainhoa Menendez,Ignacio Arganda-Carreras,Marcos Nieto

Main category: cs.CV

TL;DR: 提出一种可解释且灵活的统计方法,用于路侧LiDAR数据中的背景减除,提升自动驾驶中的基础设施感知能力。

Details Motivation: 为了提升基础设施感知在自动化驾驶中的性能,需要有效区分LiDAR点云中的前景与背景,尤其在资源受限和多类型传感器环境下。 Method: 引入高斯分布网格(GDG)建模仅含背景的扫描数据的空间统计特性,并设计过滤算法利用该表示对LiDAR点进行前景/背景分类。 Result: 在RCooper公开数据集上表现优于现有最先进方法,具有更高精度和灵活性,即使使用极少背景数据也能良好运行,且可在低资源硬件上高效实现。 Conclusion: 该方法具备良好的可解释性、适应性和高效性,适用于多种LiDAR类型和配置,适合大规模实际部署。 Abstract: We present a fully interpretable and flexible statistical method for background subtraction in roadside LiDAR data, aimed at enhancing infrastructure-based perception in automated driving. Our approach introduces both a Gaussian distribution grid (GDG), which models the spatial statistics of the background using background-only scans, and a filtering algorithm that uses this representation to classify LiDAR points as foreground or background. The method supports diverse LiDAR types, including multiline 360 degree and micro-electro-mechanical systems (MEMS) sensors, and adapts to various configurations. Evaluated on the publicly available RCooper dataset, it outperforms state-of-the-art techniques in accuracy and flexibility, even with minimal background data. Its efficient implementation ensures reliable performance on low-resource hardware, enabling scalable real-world deployment.

[206] Top-Down Semantic Refinement for Image Captioning

Jusheng Zhang,Kaitong Cai,Jing Yang,Jian Wang,Chengpei Tang,Keze Wang

Main category: cs.CV

TL;DR: 提出了一种名为Top-Down Semantic Refinement (TDSR)的新框架,将图像描述生成建模为一种目标导向的分层优化规划问题,并结合高效的蒙特卡洛树搜索算法,在减少大视觉语言模型调用次数的同时,显著提升了细粒度描述、组合泛化和幻觉抑制能力。

Details Motivation: 大视觉语言模型在单步生成中存在短视决策问题,难以在复杂场景描述中兼顾全局连贯性和细节丰富性。 Method: 将图像描述任务重新定义为基于马尔可夫决策过程(MDP)的分层规划问题,设计了面向VLM的高效蒙特卡洛树搜索(MCTS)算法,引入视觉引导的并行扩展、轻量级价值网络和自适应早停机制。 Result: 在DetailCaps、COMPOSITIONCAP和POPE等多个基准上验证了方法的有效性,作为即插即用模块显著提升了LLaVA-1.5、Qwen2.5-VL等现有VLM的性能,达到或接近最优水平。 Conclusion: TDSR通过高效的分层规划机制,有效解决了VLM在图像描述中的短视问题,在保持生成质量的同时大幅降低计算开销,具有良好的通用性和实用性。 Abstract: Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

[207] 3D Roadway Scene Object Detection with LIDARs in Snowfall Conditions

Ghazal Farhani,Taufiq Rahman,Syed Mostaquim Ali,Andrew Liu,Mohamed Zaki,Dominique Charlebois,Benoit Anctil

Main category: cs.CV

TL;DR: 本研究针对汽车级LiDAR在雪天条件下的性能退化问题,建立基于物理的模型分析其信号衰减机制,并通过合成数据评估对目标检测模型的影响。

Details Motivation: LiDAR在恶劣天气下性能显著下降,但不同天气条件下信号退化的程度尚未被充分量化,限制了自动驾驶系统在真实环境中的可靠性。 Method: 提出一种基于物理的LiDAR在雪天条件下的信号衰减模型,研究不同降雪率对信号的影响,并利用该模型将晴天数据转换为模拟雪天数据,与实测数据对比验证模型有效性。 Result: 揭示了雪粒在LiDAR源附近作为高效反射体的作用机制,成功生成反映不同降雪强度的合成数据,并用于评估预训练目标检测模型在雪天下的性能退化情况。 Conclusion: 所提出的物理模型能有效模拟LiDAR在雪天的信号衰减行为,合成数据可用于提升感知模型在恶劣天气下的鲁棒性评估与训练。 Abstract: Because 3D structure of a roadway environment can be characterized directly by a Light Detection and Ranging (LiDAR) sensors, they can be used to obtain exceptional situational awareness for assitive and autonomous driving systems. Although LiDARs demonstrate good performance in clean and clear weather conditions, their performance significantly deteriorates in adverse weather conditions such as those involving atmospheric precipitation. This may render perception capabilities of autonomous systems that use LiDAR data in learning based models to perform object detection and ranging ineffective. While efforts have been made to enhance the accuracy of these models, the extent of signal degradation under various weather conditions remains largely not quantified. In this study, we focus on the performance of an automotive grade LiDAR in snowy conditions in order to develop a physics-based model that examines failure modes of a LiDAR sensor. Specifically, we investigated how the LiDAR signal attenuates with different snowfall rates and how snow particles near the source serve as small but efficient reflectors. Utilizing our model, we transform data from clear conditions to simulate snowy scenarios, enabling a comparison of our synthetic data with actual snowy conditions. Furthermore, we employ this synthetic data, representative of different snowfall rates, to explore the impact on a pre-trained object detection model, assessing its performance under varying levels of snowfall

[208] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Vijay Veerabadran,Fanyi Xiao,Nitin Kamra,Pedro Matias,Joy Chen,Caley Drooff,Brett D Roads,Riley Williams,Ethan Henderson,Xuanyi Zhao,Kevin Carlberg,Joseph Tighe,Karl Ridgeway

Main category: cs.CV

TL;DR: 本文提出了WAGIBench,一个用于评估视觉-语言模型在多模态情境下推断用户目标能力的基准。作者收集了包含29小时、348名参与者、3477段记录的大规模数据集,并发现当前最佳模型仍显著落后于人类表现,生成的相关目标仅占55%,表明该问题尚远未解决。

Details Motivation: 为了减少用户与可穿戴助手机器人的交互负担,需要自动推断用户的潜在目标。然而现有研究缺乏合适的基准来评估此类目标推断能力,因此亟需构建高质量的评测基准。 Method: 构建了一个名为WAGIBench的新基准,包含大规模多模态数据集(视觉、音频、数字和纵向上下文),并采用多种现代视觉-语言模型进行生成式评测,通过模态消融实验分析不同输入模态对性能的影响。 Result: 人类在多选任务中达到93%准确率,而最佳视觉-语言模型为84%;生成结果中仅有55%相关,大模型表现更优但仍有明显差距;模型能从相关模态获益,且无关模态对其影响较小。 Conclusion: 当前视觉-语言模型在目标推断任务上仍远未达到实用水平,尽管大模型表现更好,但仍存在显著改进空间,WAGIBench为未来研究提供了有效评测平台。 Abstract: There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.

[209] SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning

Linhan Wang,Jianwen Dou,Wang Li,Shengkun Wang,Zhiwu Xie,Chang-Tien Lu,Yinlin Chen

Main category: cs.CV

TL;DR: 提出了一种快速、标签高效的半监督框架,用于在冷冻电子断层扫描(CryoET)中进行粒子挑选,显著提升了在稀疏标注条件下的性能。

Details Motivation: 粒子挑选是CryoET中的主要瓶颈,依赖耗时的手动标注导致大量未标注数据未被充分利用。 Method: 结合端到端热图监督检测模型与师生协同训练机制,并引入多视角伪标签和CryoET专用的DropBlock增强策略。 Result: 在大规模CZII数据集上,相比全监督基线F1分数提升10%。 Conclusion: 所提半监督框架能有效利用未标注CryoET数据,显著提升粒子挑选性能。 Abstract: Cryogenic Electron Tomography (CryoET) combined with sub-volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Particle picking, the task of localizing and classifying target proteins in 3D CryoET volumes, remains the main bottleneck. Due to the reliance on time-consuming manual labels, the vast reserve of unlabeled tomograms remains underutilized. In this work, we present a fast, label-efficient semi-supervised framework that exploits this untapped data. Our framework consists of two components: (i) an end-to-end heatmap-supervised detection model inspired by keypoint detection, and (ii) a teacher-student co-training mechanism that enhances performance under sparse labeling conditions. Furthermore, we introduce multi-view pseudo-labeling and a CryoET-specific DropBlock augmentation strategy to further boost performance. Extensive evaluations on the large-scale CZII dataset show that our approach improves F1 by 10% over supervised baselines, underscoring the promise of semi-supervised learning for leveraging unlabeled CryoET data.

[210] DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss

Jing Yang,Yufeng Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为DynaPose4D的新方法,通过结合4D高斯点阵与类别无关姿态估计技术,从单张静态图像生成高质量的4D动态内容,在运动连贯性、一致性和流畅性方面表现出色。

Details Motivation: 从单张静态图像生成高质量4D动态内容仍具挑战,传统方法在建模时间依赖性和动态几何变化方面存在局限,尤其在相机视角变化时表现不佳。 Method: 结合4D高斯点阵(4DGS)与类别无关姿态估计(CAPE)技术,利用3D高斯点阵构建3D模型,并基于选定视角的单样本支持预测多视角姿态关键点,引入监督信号提升运动一致性。 Result: 实验结果表明,DynaPose4D在动态运动生成中具有优异的连贯性、一致性和流畅性。 Conclusion: DynaPose4D框架有效解决了单图生成4D动态内容的难题,展现出在计算机视觉与动画制作领域的应用潜力。 Abstract: Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a significant challenge. Traditional methods have limitations in modeling temporal dependencies and accurately capturing dynamic geometry changes, especially when considering variations in camera perspective. To address this issue, we propose DynaPose4D, an innovative solution that integrates 4D Gaussian Splatting (4DGS) techniques with Category-Agnostic Pose Estimation (CAPE) technology. This framework uses 3D Gaussian Splatting to construct a 3D model from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency. Experimental results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation. These findings not only validate the efficacy of the DynaPose4D framework but also indicate its potential applications in the domains of computer vision and animation production.

[211] Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity

Seonghoon Yu,Dongjun Nam,Dina Katabi,Jeany Son

Main category: cs.CV

TL;DR: 提出一种高效的知识蒸馏方法,通过在单一教师模型上附加多个分支生成多样化的多视角知识,并引入两种角度多样性目标来增强语义差异,从而提升学生模型性能。

Details Motivation: 传统多教师知识蒸馏方法虽能提升性能但计算成本高,希望在单个教师模型上实现多样化知识以降低成本。 Method: 在单一教师模型上添加多个分支生成多视角输出,设计约束间角多样性和内部角多样性损失函数来保证输出的语义差异,并将这些多样化知识与原始教师知识一起蒸馏到学生模型中。 Result: 实验表明该方法优于现有知识增强方法,且兼容多种知识蒸馏框架,在不同配置下均带来泛化性能的持续提升;理论分析显示所提目标可增加集成成员多样性并降低集成损失上限。 Conclusion: 所提角度多样性知识增强方法在保持低计算成本的同时有效提升知识蒸馏性能,具有良好的通用性和应用潜力。 Abstract: Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance; however, achieving such diversity typically requires multiple teacher networks, leading to high computational costs. In this work, we propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher. To ensure meaningful semantic variation across multi-views, we introduce two angular diversity objectives: 1) constrained inter-angle diversify loss, which maximizes angles between augmented views while preserving proximity to the original teacher output, and 2) intra-angle diversify loss, which encourages an even distribution of views around the original output. The ensembled knowledge from these angularly diverse views, along with the original teacher, is distilled into the student. We further theoretically demonstrate that our objectives increase the diversity among ensemble members and thereby reduce the upper bound of the ensemble's expected loss, leading to more effective distillation. Experimental results show that our method surpasses an existing knowledge augmentation method across diverse configurations. Moreover, the proposed method is compatible with other KD frameworks in a plug-and-play fashion, providing consistent improvements in generalization performance.

[212] GateFuseNet: An Adaptive 3D Multimodal Neuroimaging Fusion Network for Parkinson's Disease Diagnosis

Rui Jin,Chen Chen,Yin Liu,Hongfu Sun,Min Zeng,Min Li,Yang Gao

Main category: cs.CV

TL;DR: 提出了一种名为GateFuseNet的自适应3D多模态融合网络,结合QSM和T1w MRI图像用于帕金森病诊断,通过门控融合模块实现模态特异性注意力和通道门控,显著提升诊断性能。

Details Motivation: 传统MRI模态对帕金森病病理敏感性较低,而定量磁化率成像(QSM)能更敏感地反映脑深部核团铁沉积,因此需要一种能有效融合QSM与T1w图像的方法以提高诊断准确性。 Method: 提出GateFuseNet,采用门控融合模块学习模态特定的注意力权重和通道门控向量,实现分层特征调制,并结合ROI引导策略增强关键区域特征。 Result: 在实验中优于三种现有最先进方法,达到85.00%准确率和92.06% AUC;消融实验证明了ROI引导、多模态融合及融合位置的重要性;Grad-CAM可视化显示模型聚焦于临床相关病理区域。 Conclusion: GateFuseNet通过有效的多模态融合策略提升了帕金森病的MRI诊断性能,具有临床应用潜力。 Abstract: Accurate diagnosis of Parkinson's disease (PD) from MRI remains challenging due to symptom variability and pathological heterogeneity. Most existing methods rely on conventional magnitude-based MRI modalities, such as T1-weighted images (T1w), which are less sensitive to PD pathology than Quantitative Susceptibility Mapping (QSM), a phase-based MRI technique that quantifies iron deposition in deep gray matter nuclei. In this study, we propose GateFuseNet, an adaptive 3D multimodal fusion network that integrates QSM and T1w images for PD diagnosis. The core innovation lies in a gated fusion module that learns modality-specific attention weights and channel-wise gating vectors for selective feature modulation. This hierarchical gating mechanism enhances ROI-aware features while suppressing irrelevant signals. Experimental results show that our method outperforms three existing state-of-the-art approaches, achieving 85.00% accuracy and 92.06% AUC. Ablation studies further validate the contributions of ROI guidance, multimodal integration, and fusion positioning. Grad-CAM visualizations confirm the model's focus on clinically relevant pathological regions. The source codes and pretrained models can be found at https://github.com/YangGaoUQ/GateFuseNet

[213] Open Multimodal Retrieval-Augmented Factual Image Generation

Yang Tian,Fan Liu,Jingyuan Zhang,Wei Bi,Yupeng Hu,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出了ORIG,一种基于开放多模态检索增强的框架,用于事实性图像生成(FIG),通过从网络中迭代检索和过滤多模态证据,提升生成图像的事实一致性和视觉质量。

Details Motivation: 大型多模态模型(LMMs)在生成逼真图像方面表现优异,但在涉及细粒度属性或时间敏感内容时常产生与事实不符的结果。传统检索增强方法依赖静态数据源且证据整合浅层,难以确保生成内容的事实准确性。 Method: 提出ORIG框架,采用代理式开放多模态检索机制,迭代地从网络获取并筛选图文证据,并将提炼出的知识逐步融入提示中以指导图像生成。同时构建FIG-Eval基准,涵盖感知、组合和时间三个维度共十个类别,用于系统评估。 Result: 实验表明,ORIG在多个基准上显著优于强基线模型,有效提升了生成图像的事实一致性与整体质量。 Conclusion: ORIG展示了开放多模态检索在事实性图像生成中的巨大潜力,为解决LMMs的事实错误问题提供了新路径。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.

[214] AesCrop: Aesthetic-driven Cropping Guided by Composition

Yen-Hong Wong,Lai-Kuan Wong

Main category: cs.CV

TL;DR: 本文提出了一种新的美学驱动图像裁剪模型AesCrop,结合了VMamba编码器与新颖的Mamba Composition Attention Bias(MCAB)以及Transformer解码器,实现端到端的基于排序的图像裁剪,生成多个裁剪结果及其质量评分。

Details Motivation: 现有混合方法未能融入摄影构图指导,限制了视觉美感的表现,因此需要一种能够显式编码构图线索的方法来提升图像裁剪的质量。 Method: 采用VMamba图像编码器并引入MCAB模块以增强对构图显著区域的关注,结合Transformer解码器进行端到端的排名式图像裁剪。 Result: 实验表明,AesCrop在定量指标和视觉效果上均优于当前最先进的方法,能生成更多样化且更具美感的裁剪结果。 Conclusion: AesCrop通过显式整合构图先验知识,在多样性和全局性之间取得了更好平衡,显著提升了美学驱动图像裁剪的性能。 Abstract: Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visual appeal is composition--the deliberate arrangement of elements within an image. Some methods have successfully incorporated compositional knowledge through evaluation-based and regression-based paradigms. However, evaluation-based methods lack globality while regression-based methods lack diversity. Recently, hybrid approaches that integrate both paradigms have emerged, bridging the gap between these two to achieve better diversity and globality. Notably, existing hybrid methods do not incorporate photographic composition guidance, a key attribute that defines photographic aesthetics. In this work, we introduce AesCrop, a composition-aware hybrid image-cropping model that integrates a VMamba image encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a transformer decoder to perform end-to-end rank-based image cropping, generating multiple crops along with the corresponding quality scores. By explicitly encoding compositional cues into the attention mechanism, MCAB directs AesCrop to focus on the most compositionally salient regions. Extensive experiments demonstrate that AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.

[215] Bag-of-Word-Groups (BoWG): A Robust and Efficient Loop Closure Detection Method Under Perceptual Aliasing

Xiang Fei,Tina Tian,Howie Choset,Lu Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Bag-of-Word-Groups (BoWG)的新型回环检测方法,通过引入词组和时间一致性建模,在精度、召回率和计算效率方面优于现有方法,尤其适用于感知模糊的狭窄管道等环境。

Details Motivation: 传统回环检测方法在特征稀疏、存在重复纹理的感知模糊环境中性能下降,且现有解决方案常伴随高计算成本,因此需要一种高效且鲁棒的回环检测方法。 Method: 提出BoWG方法,利用词组捕捉视觉词的空间共现与邻近关系以构建在线词典;结合受概率转移模型启发的时间一致性机制进行相似性计算;并设计特征分布分析模块和后验证机制增强检测性能。 Result: 在公开数据集和自建管道数据集上的实验表明,BoWG在精度-召回率和计算效率上均优于传统及基于学习的最先进方法,在Bicocca25b数据集中平均每帧图像处理时间为16毫秒。 Conclusion: BoWG在复杂环境中实现了高精度、高效率的回环检测,具有良好的可扩展性和实际应用潜力。 Abstract: Loop closure is critical in Simultaneous Localization and Mapping (SLAM) systems to reduce accumulative drift and ensure global mapping consistency. However, conventional methods struggle in perceptually aliased environments, such as narrow pipes, due to vector quantization, feature sparsity, and repetitive textures, while existing solutions often incur high computational costs. This paper presents Bag-of-Word-Groups (BoWG), a novel loop closure detection method that achieves superior precision-recall, robustness, and computational efficiency. The core innovation lies in the introduction of word groups, which captures the spatial co-occurrence and proximity of visual words to construct an online dictionary. Additionally, drawing inspiration from probabilistic transition models, we incorporate temporal consistency directly into similarity computation with an adaptive scheme, substantially improving precision-recall performance. The method is further strengthened by a feature distribution analysis module and dedicated post-verification mechanisms. To evaluate the effectiveness of our method, we conduct experiments on both public datasets and a confined-pipe dataset we constructed. Results demonstrate that BoWG surpasses state-of-the-art methods, including both traditional and learning-based approaches, in terms of precision-recall and computational efficiency. Our approach also exhibits excellent scalability, achieving an average processing time of 16 ms per image across 17,565 images in the Bicocca25b dataset.

[216] SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning

Chen Chen,Majid Abdolshah,Violetta Shevchenko,Hongdong Li,Chang Xu,Pulak Purkait

Main category: cs.CV

TL;DR: 提出了一种新的即插即用的空间重聚焦超分辨率(SRSR)框架,通过空间重聚焦交叉注意力(SRCA)和空间目标无分类器指导(STCFG)机制,提升文本引导的准确性和生成图像的质量。

Details Motivation: 现有基于扩散的超分辨率方法由于文本条件不准确和不完整,以及交叉注意力易关注无关像素,导致语义模糊和生成细节失真。 Method: 引入两种新机制:1)空间重聚焦交叉注意力(SRCA),利用视觉定位的分割掩码在推理时优化文本条件;2)空间目标无分类器指导(STCFG),对未定位像素选择性屏蔽文本影响以防止幻觉。 Result: 在多个合成和真实数据集上,SRSR在保真度指标(PSNR、SSIM)上全面优于7个最先进基线,并在感知质量(LPIPS、DISTS)两个真实基准上表现更优。 Conclusion: SRSR有效提升了超分辨率生成结果的语义一致性和感知质量,具有良好的通用性和实用性。 Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.

[217] STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

Mahiro Ukai,Shuhei Kurita,Nakamasa Inoue

Main category: cs.CV

TL;DR: 本文提出了首个用于评估视觉-语言模型(VLM)理解物体状态细微变化的基准STATUS Bench,包含对象状态识别、图像检索和状态变化识别三项任务,并构建了大规模训练数据集STATUS Train。实验表明现有VLM在该任务上表现不佳,凸显了新基准和数据集的重要性。

Details Motivation: 现有VLM在识别物体状态(如开/关、开/关)方面的能力尚不明确,缺乏专门的基准来严格评估其对细微状态变化的理解能力。 Method: 提出STATUS Bench基准,采用包含图像对、状态描述和状态变化描述的手工数据集,设计三合一评估方案(OSI、IR、SCI);同时构建包含1300万条描述的半自动训练数据集STATUS Train,并对VLM进行微调实验。 Result: 实验显示当前最先进的VLM在STATUS Bench上表现较差,多数开源模型零样本性能接近随机水平;经STATUS Train微调后,Qwen2.5-VL性能可媲美Gemini 2.0 Flash。 Conclusion: STATUS Bench和STATUS Train为推进VLM在物体状态识别方面的能力提供了必要且有效的工具和资源,揭示了当前模型的局限性并指明改进方向。 Abstract: Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.

[218] MELDAE: A Framework for Micro-Expression Spotting, Detection, and Automatic Evaluation in In-the-Wild Conversational Scenes

Yigui Feng,Qinglin Wang,Yang Liu,Ke Liu,Haotian Mo,Enhao Huang,Gencheng Liu,Mingzhe Liu,Jie Liu

Main category: cs.CV

TL;DR: 提出首个面向自然对话场景的微表情数据集及端到端检测框架MELDAE,并通过边界感知损失函数显著提升时序定位精度。

Details Motivation: 现有微表情研究多基于实验室环境数据集,在真实自然对话等复杂场景下性能显著下降,难以准确检测自发、无意识的微表情。 Method: 提出了三个贡献:一是构建首个面向自然对话的微表情数据集;二是设计了端到端的微表情定位与检测框架MELDAE;三是引入一种新的边界感知损失函数,以减少起始和结束时间的预测误差。 Result: 在WDMD数据集上,所提方法比最强基线模型在F1_{DR}定位指标上提升了17.72%,并在多个现有基准上展现出优异的泛化能力。 Conclusion: 该研究有效提升了微表情在真实场景下的检测性能,为实际应用中的情感识别提供了更可靠的解决方案。 Abstract: Accurately analyzing spontaneous, unconscious micro-expressions is crucial for revealing true human emotions, but this task remains challenging in wild scenarios, such as natural conversation. Existing research largely relies on datasets from controlled laboratory environments, and their performance degrades dramatically in the real world. To address this issue, we propose three contributions: the first micro-expression dataset focused on conversational-in-the-wild scenarios; an end-to-end localization and detection framework, MELDAE; and a novel boundary-aware loss function that improves temporal accuracy by penalizing onset and offset errors. Extensive experiments demonstrate that our framework achieves state-of-the-art results on the WDMD dataset, improving the key F1_{DR} localization metric by 17.72% over the strongest baseline, while also demonstrating excellent generalization capabilities on existing benchmarks.

[219] From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy

Feng He,Guodong Tan,Qiankun Li,Jun Yu,Quan Wen

Main category: cs.CV

TL;DR: 本文提出了用于X光场显微镜(XLFM)3D重建的三项关键贡献:构建了大规模的XLFM-Zebrafish基准数据集,提出了一种自监督学习角先验的Masked View Modeling方法(MVN-LF),并设计了光学渲染一致性损失(ORC Loss)以增强物理一致性。实验表明,该方法在PSNR上比现有最先进方法提升7.7%。

Details Motivation: 由于缺乏标准化数据集和能有效建模角度-空间结构且具有物理基础的方法,基于学习的XLFM 3D重建发展受限。 Method: 提出了MVN-LF自监督任务,通过预测被遮挡视角来学习角度先验;引入ORC Loss,利用可微分渲染约束预测体积与其点扩散函数前向投影之间的一致性。 Result: 在XLFM-Zebrafish基准上,所提方法相比现有最优基线PSNR提升了7.7%。 Conclusion: 本文通过构建基准数据集、引入自监督学习策略和物理一致性的损失函数,显著提升了XLFM的3D重建性能,推动了学习-based方法在神经成像中的应用。 Abstract: Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, notable for its single-exposure volumetric imaging, broad field of view, and high temporal resolution. However, learning-based 3D reconstruction in XLFM remains underdeveloped due to two core challenges: the absence of standardized datasets and the lack of methods that can efficiently model its angular-spatial structure while remaining physically grounded. We address these challenges by introducing three key contributions. First, we construct the XLFM-Zebrafish benchmark, a large-scale dataset and evaluation suite for XLFM reconstruction. Second, we propose Masked View Modeling for Light Fields (MVN-LF), a self-supervised task that learns angular priors by predicting occluded views, improving data efficiency. Third, we formulate the Optical Rendering Consistency Loss (ORC Loss), a differentiable rendering constraint that enforces alignment between predicted volumes and their PSF-based forward projections. On the XLFM-Zebrafish benchmark, our method improves PSNR by 7.7% over state-of-the-art baselines.

[220] Cross-View UAV Geo-Localization with Precision-Focused Efficient Design: A Hierarchical Distillation Approach with Multi-view Refinement

Jian Sun,Kangdao Liu,Chi Zhang,Chuangquan Chen,Junge Shen,Chi-Man Vong

Main category: cs.CV

TL;DR: 提出了一种高效的跨视角地理定位框架PFED,结合分层知识迁移和多视图表示优化,在保持高精度的同时显著降低计算开销,适用于边缘设备上的实时无人机定位。

Details Motivation: 现有跨视角地理定位方法依赖复杂的特征提取与对齐机制,导致推理成本高,难以部署于资源受限的边缘设备。 Method: 提出PFED框架,包含训练阶段的分层蒸馏(HD-CVGL)与不确定性感知对齐(UAPA),以及推理阶段的多视图优化模块(MRM),通过知识蒸馏和冗余样本过滤提升效率与性能。 Result: 在University-1652数据集上达到97.15%的Recall@1,FLOPs减少5倍以上,速度提升3倍以上,并在AGX Orin上实现251.5 FPS的实时推理。 Conclusion: PFED在精度和效率之间实现了优越平衡,具备在GNSS拒止环境下用于无人机实时自主导航的实际部署能力。 Abstract: Cross-view geo-localization (CVGL) enables UAV localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environments. However, existing methods rely on resource-intensive fine-grained feature extraction and alignment, where multiple branches and modules significantly increase inference costs, limiting their deployment on edge devices. We propose Precision-Focused Efficient Design (PFED), a resource-efficient framework combining hierarchical knowledge transfer and multi-view representation refinement. This innovative method comprises two key components: 1) During training, Hierarchical Distillation paradigm for fast and accurate CVGL (HD-CVGL), coupled with Uncertainty-Aware Prediction Alignment (UAPA) to distill essential information and mitigate the data imbalance without incurring additional inference overhead. 2) During inference, an efficient Multi-view Refinement Module (MRM) leverages mutual information to filter redundant samples and effectively utilize the multi-view data. Extensive experiments show that PFED achieves state-of-the-art performance in both accuracy and efficiency, reaching 97.15\% Recall@1 on University-1652 while being over $5 \times$ more efficient in FLOPs and $3 \times$ faster than previous top methods. Furthermore, PFED runs at 251.5 FPS on the AGX Orin edge device, demonstrating its practical viability for real-time UAV applications. The project is available at https://github.com/SkyEyeLoc/PFED

[221] PSScreen V2: Partially Supervised Multiple Retinal Disease Screening

Boyi Zheng,Yalin Zheng,Hrvoje Bogunović,Qing Liu

Main category: cs.CV

TL;DR: 提出PSScreen V2,一种用于多种视网膜疾病筛查的半监督自训练框架,能利用多个部分标注、不同分布的数据集,通过三分支结构和低频特征增强策略实现优异的跨域泛化性能。

Details Motivation: 现有方法依赖完全标注或单域数据集,难以应对标签缺失和域偏移问题,限制了在真实多源医疗数据下的应用。 Method: 采用三分支架构:一个教师网络生成弱增强图像的伪标签以弥补标签缺失;两个学生网络引入新的特征增强策略——低频Dropout(LF-Dropout)和低频不确定性(LF-Uncert),分别通过丢弃低频成分和对抗性高斯扰动来提升域鲁棒性和建模域变异不确定性。 Result: 在多个眼底图像数据集上实现了最先进的性能和出色的域外泛化能力,并在不同骨干网络(包括DINOv2)和胸部X光数据集上验证了框架的通用性与适应性。 Conclusion: PSScreen V2有效解决了多源医疗数据中的标签缺失与域偏移挑战,具备良好的通用性和临床应用潜力。 Abstract: In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at https://github.com/boyiZheng99/PSScreen_V2.

[222] Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data

Yuang Wang,Pengfei Jin,Siyeop Yoon,Matthew Tivnan,Shaoyang Zhang,Li Zhang,Quanzheng Li,Zhiqiang Chen,Dufan Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的基于扩散桥模型的CT图像重建方法——投影嵌入扩散桥(PEDB),通过在反向随机微分方程中显式地引入不完整投影数据,实现了更好的数据一致性与细节恢复能力,在稀疏视角、有限角度和截断投影等不完整数据下均优于现有扩散桥模型。

Details Motivation: 由于CT图像重建在投影数据不完整时具有病态性,现有扩散桥模型虽能从滤波反投影(FBP)结果中恢复图像,但缺乏对数据一致性的有效融合,限制了重建质量与细节恢复能力。 Method: 提出投影嵌入扩散桥(PEDB),设计一种新的反向随机微分方程(SDE),将不完整投影数据直接嵌入到分数函数中,并推导出后验分数的可计算表达式;引入自由参数控制反向过程的随机性,并设计离散化方案以减少误差。 Result: 在稀疏视角、有限角度和截断投影三种不完整数据上进行了广泛实验,结果显示PEDB在标准、含噪及域偏移评估中均优于现有的先进扩散桥模型。 Conclusion: PEDB通过显式利用投影数据实现了更好的数据一致性,在多种不完整CT重建任务中表现出卓越性能,为扩散模型在医学图像重建中的应用提供了新方向。 Abstract: Reconstructing CT images from incomplete projection data remains challenging due to the ill-posed nature of the problem. Diffusion bridge models have recently shown promise in restoring clean images from their corresponding Filtered Back Projection (FBP) reconstructions, but incorporating data consistency into these models remains largely underexplored. Incorporating data consistency can improve reconstruction fidelity by aligning the reconstructed image with the observed projection data, and can enhance detail recovery by integrating structural information contained in the projections. In this work, we propose the Projection Embedded Diffusion Bridge (PEDB). PEDB introduces a novel reverse stochastic differential equation (SDE) to sample from the distribution of clean images conditioned on both the FBP reconstruction and the incomplete projection data. By explicitly conditioning on the projection data in sampling the clean images, PEDB naturally incorporates data consistency. We embed the projection data into the score function of the reverse SDE. Under certain assumptions, we derive a tractable expression for the posterior score. In addition, we introduce a free parameter to control the level of stochasticity in the reverse process. We also design a discretization scheme for the reverse SDE to mitigate discretization error. Extensive experiments demonstrate that PEDB achieves strong performance in CT reconstruction from three types of incomplete data, including sparse-view, limited-angle, and truncated projections. For each of these types, PEDB outperforms evaluated state-of-the-art diffusion bridge models across standard, noisy, and domain-shift evaluations.

[223] SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing

Yassh Ramchandani,Vijayashekhar S S,Jignesh S. Bhatt

Main category: cs.CV

TL;DR: 本文提出了一种名为SWAN的三阶段自监督小波神经网络,用于从高光谱图像中联合估计端元和丰度。

Details Motivation: 利用小波变换的稀疏、分布式和多尺度表示特性,结合自监督学习范式,挖掘高光谱数据中的潜在对称性,以实现无需真实标签的高效解混。 Method: 将高光谱波段图像扩展到双正交小波基空间,在三个阶段中分别进行编码、重构和物理建模;采用联合损失函数在图像获取域中进行自监督训练,使用Adam优化,并引入Sigmoid与Dropout防止过拟合,同时使用核正则化保持端元系数的空间变化。 Result: 在多个合成和真实高光谱数据集上实验表明,SWAN在定性和定量评估中均优于多种最先进的基于神经网络的解混方法,且具有更强的鲁棒性和紧凑的网络参数。 Conclusion: SWAN通过自监督学习和小波多尺度表示有效实现了高光谱解混,无需地面真值即可训练,适用于实际应用场景。 Abstract: In this article, we present SWAN: a three-stage, self-supervised wavelet neural network for joint estimation of endmembers and abundances from hyperspectral imagery. The contiguous and overlapping hyperspectral band images are first expanded to Biorthogonal wavelet basis space that provides sparse, distributed, and multi-scale representations. The idea is to exploit latent symmetries from thus obtained invariant and covariant features using a self-supervised learning paradigm. The first stage, SWANencoder maps the input wavelet coefficients to a compact lower-dimensional latent space. The second stage, SWANdecoder uses the derived latent representation to reconstruct the input wavelet coefficients. Interestingly, the third stage SWANforward learns the underlying physics of the hyperspectral image. A three-stage combined loss function is formulated in the image acquisition domain that eliminates the need for ground truth and enables self-supervised training. Adam is employed for optimizing the proposed loss function, while Sigmoid with a dropout of 0.3 is incorporated to avoid possible overfitting. Kernel regularizers bound the magnitudes and preserve spatial variations in the estimated endmember coefficients. The output of SWANencoder represents estimated abundance maps during inference, while weights of SWANdecoder are retrieved to extract endmembers. Experiments are conducted on two benchmark synthetic data sets with different signal-to-noise ratios as well as on three real benchmark hyperspectral data sets while comparing the results with several state-of-the-art neural network-based unmixing methods. The qualitative, quantitative, and ablation results show performance enhancement by learning a resilient unmixing function as well as promoting self-supervision and compact network parameters for practical applications.

[224] Cross-Species Transfer Learning in Agricultural AI: Evaluating ZebraPose Adaptation for Dairy Cattle Pose Estimation

Mackenzie Tapp,Sibi Chakravarthy Parivendan,Kashfia Sailunaz,Suresh Neethirajan

Main category: cs.CV

TL;DR: 本研究评估了跨物种迁移学习在奶牛姿态估计中的潜力与局限,使用基于合成斑马图像训练的ZebraPose模型进行27个关键点检测。尽管在分布内数据上表现良好(AP=0.86),但在新 barn 和牛群中泛化能力显著下降,揭示了合成到真实场景的领域差距问题,呼吁农业AI应优先考虑农场真实性和跨环境鲁棒性。

Details Motivation: 由于缺乏大规模标注的家畜数据集,尤其是奶牛,农业场景下的姿态估计发展受限。因此,探索跨物种迁移学习是否能缓解数据稀缺问题成为研究动机。 Method: 采用基于视觉Transformer的ZebraPose模型,将在合成斑马图像上训练的模型迁移到奶牛姿态估计任务中,并在三种配置下评估:自建农场数据集(375张图像)、APT-36K子集及其组合,评估模型在不同环境下的准确性和泛化能力。 Result: 组合模型在分布内数据上取得较好性能(AP = 0.86, AR = 0.87, PCK 0.5 = 0.869),但在未见过的牛棚和牛群中出现严重泛化失败,表明形态相似性不足以支持跨域迁移,合成到真实的领域差距是主要障碍。 Conclusion: 仅靠跨物种迁移无法克服农业AI中的领域差距,需推动以农业为先的AI设计,强调农场级真实性、跨环境鲁棒性及开放基准数据集,以实现可信且可扩展的动物中心技术。 Abstract: Pose estimation serves as a cornerstone of computer vision for understanding animal posture, behavior, and welfare. Yet, agricultural applications remain constrained by the scarcity of large, annotated datasets for livestock, especially dairy cattle. This study evaluates the potential and limitations of cross-species transfer learning by adapting ZebraPose - a vision transformer-based model trained on synthetic zebra imagery - for 27-keypoint detection in dairy cows under real barn conditions. Using three configurations - a custom on-farm dataset (375 images, Sussex, New Brunswick, Canada), a subset of the APT-36K benchmark dataset, and their combination, we systematically assessed model accuracy and generalization across environments. While the combined model achieved promising performance (AP = 0.86, AR = 0.87, PCK 0.5 = 0.869) on in-distribution data, substantial generalization failures occurred when applied to unseen barns and cow populations. These findings expose the synthetic-to-real domain gap as a major obstacle to agricultural AI deployment and emphasize that morphological similarity between species is insufficient for cross-domain transfer. The study provides practical insights into dataset diversity, environmental variability, and computational constraints that influence real-world deployment of livestock monitoring systems. We conclude with a call for agriculture-first AI design, prioritizing farm-level realism, cross-environment robustness, and open benchmark datasets to advance trustworthy and scalable animal-centric technologies.

[225] Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization

Adinath Dukre,Ankan Deria,Yutong Xie,Imran Razzak

Main category: cs.CV

TL;DR: 提出了一种基于DenseNet-121的框架,结合染色感知增强和不平衡感知学习,用于在MIDOG 2025(Track 2)中进行非典型有丝分裂分类,在多个独立域上表现出良好的泛化能力。

Details Motivation: 非典型有丝分裂图像是肿瘤侵袭性的重要生物标志物,但由于类别严重不平衡和跨成像域的变异性,可靠识别仍然具有挑战性。 Method: 采用基于DenseNet-121的框架,结合Macenko染色感知增强、几何与强度变换,并通过加权采样和结合类别加权二元交叉熵与焦点损失的混合目标实现不平衡感知学习,使用AdamW端到端训练。 Result: 在官方测试集上达到85.0%的平衡准确率、0.927的AUROC、89.2%的敏感性和80.9%的特异性,模型在扫描仪和染色变化下表现出强泛化能力。 Conclusion: 结合DenseNet-121、染色感知增强和不平衡自适应目标的方法可构建鲁棒且具有领域通用性的非典型有丝分裂分类框架,适用于真实世界的计算病理工作流。 Abstract: Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across imaging domains. We present a DenseNet-121-based framework tailored for atypical mitosis classification in the MIDOG 2025 (Track 2) setting. Our method integrates stain-aware augmentation (Macenko), geometric and intensity transformations, and imbalance-aware learning via weighted sampling with a hybrid objective combining class-weighted binary cross-entropy and focal loss. Trained end-to-end with AdamW and evaluated across multiple independent domains, the model demonstrates strong generalization under scanner and staining shifts, achieving balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on the official test set. These results indicate that combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework for atypical mitosis classification suitable for real-world computational pathology workflows.

[226] A Critical Study on Tea Leaf Disease Detection using Deep Learning Techniques

Nabajyoti Borah,Raju Moni Borah,Bandan Boruah,Purnendu Bikash Acharjee,Sajal Saha,Ripjyoti Hazarika

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的茶叶病害分类方法,能够识别由害虫和病原体引起的三种茶树叶片病害(红锈病、Helopeltis 和红蜘蛛螨),并利用Mask R-CNN实现对叶片受损区域的分割与量化。比较了SSD MobileNet V2和Faster R-CNN ResNet50 V1两种目标检测模型,结果显示后者在mAP(25%)等指标上表现更优。

Details Motivation: 茶叶病害严重影响产量和品质,传统人工检测效率低且依赖经验,因此需要一种自动化、准确的病害识别与受损评估方法。 Method: 采用SSD MobileNet V2和Faster R-CNN ResNet50 V1进行病害目标检测,并使用Mask R-CNN实现像素级实例分割;提出了一种自定义方法来计算叶片的病损面积。 Result: Faster R-CNN ResNet50 V1在IOU 0.50:0.95范围内的精度为0.252,召回率为0.044,mAP为25%,优于SSD MobileNet V2的20.9%;同时实现了对病害区域的定量分析。 Conclusion: Faster R-CNN在茶叶病害检测中表现优于SSD MobileNet V2,结合Mask R-CNN可有效实现病害识别与受损区域的精确分割,为茶叶病害智能诊断提供了可行方案。 Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.

[227] Self-Attention Decomposition For Training Free Diffusion Editing

Tharun Anand,Mohammad Hassan Vali,Arno Solin

Main category: cs.CV

TL;DR: 提出一种基于预训练扩散模型参数的解析方法,通过计算自注意力权重矩阵的特征向量来直接获取语义编辑方向,无需额外数据或微调,显著提升编辑效率。

Details Motivation: 现有方法依赖大量采样或辅助网络训练,效率较低,难以快速实现对扩散模型输出的精确语义控制。 Method: 分析扩散模型中自注意力权重矩阵的结构信息,利用其特征向量提取可解释的语义编辑方向。 Result: 在多个数据集上实现了高质量的图像编辑,编辑时间比当前基准方法减少60%。 Conclusion: 该方法无需额外数据或微调即可高效生成可解释的编辑方向,为扩散模型的可控生成提供了高效且实用的解决方案。 Abstract: Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model's latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.

[228] SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery

Qiwei Ma,Zhiyu Wang,Wang Liu,Xukun Lu,Bin Deng,Puhong Duan,Xudong Kang,Shutao Li

Main category: cs.CV

TL;DR: 本文提出了SARCLIP-1M数据集和SARCLIP模型,首个面向合成孔径雷达(SAR)领域的视觉语言基础模型,通过对比学习和领域迁移策略实现SAR图像与文本的对齐,在图像-文本检索和零样本分类任务中表现出色。

Details Motivation: 现有SAR基础模型多关注低层次视觉特征,缺乏多模态对齐和零样本目标识别能力,本文旨在提升SAR图像的语义理解与跨模态关联。 Method: 构建包含一百万图文对的大规模数据集SARCLIP-1M,并提出基于对比视觉语言学习和领域迁移策略训练的SARCLIP模型。 Result: 在图像-文本检索和零样本分类任务上,SARCLIP显著优于现有基础模型,展现出强大的特征提取与语义解释能力。 Conclusion: SARCLIP首次实现了SAR图像与文本的有效对齐,推动了SAR图像的语义理解和多模态智能分析的发展。 Abstract: Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling (MIM) have paved the way for SAR foundation models, these approaches primarily focus on low-level visual features, often overlooking multimodal alignment and zero-shot target recognition within SAR imagery. To address this limitation, we construct SARCLIP-1M, a large-scale vision language dataset comprising over one million text-image pairs aggregated from existing datasets. We further introduce SARCLIP, the first vision language foundation model tailored for the SAR domain. Our SARCLIP model is trained using a contrastive vision language learning approach by domain transferring strategy, enabling it to bridge the gap between SAR imagery and textual descriptions. Extensive experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP in feature extraction and interpretation, significantly outperforming state-of-the-art foundation models and advancing the semantic understanding of SAR imagery. The code and datasets will be released soon.

[229] LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering

Wenkai Zhu,Xu Li,Qimin Xu,Benwu Wang,Kun Wei,Yiming Peng,Zihang Wang

Main category: cs.CV

TL;DR: 提出了一种新的LiDAR-视觉3D高斯点阵SLAM系统LVD-GS,通过分层协同表示和联合动态建模模块,在大尺度动态场景中实现了更优的建图性能。

Details Motivation: 现有方法依赖单一表示方式,在大尺度动态户外场景中存在累积位姿误差和尺度模糊问题,限制了性能。 Method: 引入基于人类思维链过程的分层协同表示模块,实现映射优化的相互增强;提出联合动态建模模块,融合开放世界分割与隐式残差约束生成细粒度动态掩码,结合DINO-Depth特征的不确定性估计去除动态物体影响。 Result: 在KITTI、nuScenes和自采集数据集上的实验表明,该方法在精度和鲁棒性方面优于现有方法,达到最先进水平。 Conclusion: LVD-GS通过多模态协同表示和动态对象处理机制,有效缓解了尺度漂移问题,提升了复杂动态环境下的重建质量与定位精度。 Abstract: 3D Gaussian Splatting SLAM has emerged as a widely used technique for high-fidelity mapping in spatial intelligence. However, existing methods often rely on a single representation scheme, which limits their performance in large-scale dynamic outdoor scenes and leads to cumulative pose errors and scale ambiguity. To address these challenges, we propose \textbf{LVD-GS}, a novel LiDAR-Visual 3D Gaussian Splatting SLAM system. Motivated by the human chain-of-thought process for information seeking, we introduce a hierarchical collaborative representation module that facilitates mutual reinforcement for mapping optimization, effectively mitigating scale drift and enhancing reconstruction robustness. Furthermore, to effectively eliminate the influence of dynamic objects, we propose a joint dynamic modeling module that generates fine-grained dynamic masks by fusing open-world segmentation with implicit residual constraints, guided by uncertainty estimates from DINO-Depth features. Extensive evaluations on KITTI, nuScenes, and self-collected datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.

[230] Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Anna Deichler,Jonas Beskow

Main category: cs.CV

TL;DR: Look and Tell 是一个用于研究自我中心与外部视角下指代表达的多模态数据集,包含使用智能眼镜和固定摄像头同步记录的眼动、语音和视频数据。

Details Motivation: 旨在推动具身智能体在情境对话中的理解与交互能力,研究不同空间表征对多模态指代消解的影响。 Method: 通过 Meta Project Aria 智能眼镜和固定相机同步采集25名参与者在厨房环境中指导同伴识别食材时的眼动、语音和视频,并结合3D场景重建。 Result: 数据集包含3.67小时的记录,涵盖2,707个富含标注的指代表达,支持对2D与3D、自我中心与外部视角的空间表示进行基准评估。 Conclusion: 该数据集为评估不同空间表示在多模态指代理解中的作用提供了重要资源,有助于推动具身化智能体在真实场景中进行有效对话的发展。 Abstract: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

[231] Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Hagay Michaeli,Daniel Soudry

Main category: cs.CV

TL;DR: 提出了一种无混叠的视觉Transformer(Alias-Free ViT),通过无混叠下采样和线性交叉协方差注意力机制,实现对整数和分数平移的平移等变性,提升了模型在图像分类中的平移鲁棒性。

Details Motivation: Vision Transformers(ViTs)缺乏卷积网络的平移不变性,对微小图像平移更敏感,限制了其性能。尽管已有研究指出卷积网络也存在混叠问题,但尚未在ViT中有效解决。因此,需要设计一种具备平移不变性的ViT架构。 Method: 提出Alias-Free ViT,包含两个关键组件:一是使用无混叠的下采样和非线性操作;二是引入线性交叉协方差注意力机制,该机制对整数和分数平移均具有平移等变性,从而构建平移不变的全局表征。 Result: 该模型在图像分类任务中保持了竞争力,并在面对对抗性平移时优于同等规模的其他模型,显著提升了平移鲁棒性。 Conclusion: 通过引入无混叠设计和等变注意力机制,ViT可以实现更好的平移不变性,证明了结合归纳偏置有助于提升Transformer在视觉任务中的鲁棒性和性能。 Abstract: Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets' translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.

[232] DAMap: Distance-aware MapNet for High Quality HD Map Construction

Jinpeng Dong,Chen Li,Yutong Lin,Jingwen Fu,Sanping Zhou,Nanning Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的高精地图构建方法DAMap,通过引入距离感知焦点损失、混合损失策略和任务调制可变形注意力机制,解决了当前方法在高质量预测中的任务错位问题,并在多个基准上取得了性能提升。

Details Motivation: 现有的高精地图预测方法由于标签分配不当和特征提取不充分,在高质量预测上表现不佳,存在任务错位问题。 Method: 提出了DAMap方法,包含三个核心组件:距离感知焦点损失(DAFL)、混合损失策略(HLS)和任务调制可变形注意力(TMDA),分别解决标签分配、特征判别性和损失利用的问题。 Result: 在NuScenes和Argoverse2数据集上,DAMap在不同指标、基线、划分、主干网络和训练策略下均实现了性能提升。 Conclusion: DAMap有效缓解了HD地图预测中的任务错位问题,显著提升了分类与定位质量,具有良好的通用性和应用前景。 Abstract: Predicting High-definition (HD) map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions due to inherent task misalignment. Two main factors are responsible for misalignment: 1) inappropriate task labels due to one-to-many matching queries sharing the same labels, and 2) sub-optimal task features due to task-shared sampling mechanism. In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, the HLS is proposed to better utilize the advantages of the DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules. Code will be available at https://github.com/jpdong-xjtu/DAMap.

[233] Estimation of Fireproof Structure Class and Construction Year for Disaster Risk Assessment

Hibiki Ayabe,Kazushi Okamoto,Koki Karube,Atsushi Shibata,Kei Harada

Main category: cs.CV

TL;DR: 本研究提出了一种多任务学习模型,利用建筑立面图像预测日本住宅的建造年份、结构类型和房产类型,并据此推导出防火等级(H/T/M),在大规模数据集上表现出高准确性和鲁棒性。

Details Motivation: 由于日本许多建筑物的关键元数据(如建造年份和结构类型)缺失或过时,尤其是在二手房市场,难以进行准确的火灾风险评估和保险定价,因此需要一种可扩展的方法来自动推断这些属性。 Method: 采用多任务学习模型,从建筑立面图像中联合预测建造年份、结构类型和房产类型,并通过基于官方保险标准的规则映射推导出防火等级(H/T/M)。使用大规模日本住宅图像数据集进行训练与评估,包含严格的数据清洗和去重流程。 Result: 模型在建造年份回归任务中达到高精度,在类别不平衡的情况下仍保持稳健的分类性能;定性分析表明模型能有效捕捉与建筑年代和材料相关的视觉特征。 Conclusion: 该方法证明了基于图像的可扩展、可解释风险画像系统的可行性,可在保险定价、城市规划和灾害预防中应用。 Abstract: Structural fireproof classification is vital for disaster risk assessment and insurance pricing in Japan. However, key building metadata such as construction year and structure type are often missing or outdated, particularly in the second-hand housing market. This study proposes a multi-task learning model that predicts these attributes from facade images. The model jointly estimates the construction year, building structure, and property type, from which the structural fireproof class - defined as H (non-fireproof), T (semi-fireproof), or M (fireproof) - is derived via a rule-based mapping based on official insurance criteria. We trained and evaluated the model using a large-scale dataset of Japanese residential images, applying rigorous filtering and deduplication. The model achieved high accuracy in construction-year regression and robust classification across imbalanced categories. Qualitative analyses show that it captures visual cues related to building age and materials. Our approach demonstrates the feasibility of scalable, interpretable, image-based risk-profiling systems, offering potential applications in insurance, urban planning, and disaster preparedness.

[234] RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance

Jiuniu Wang,Gongjie Zhang,Quanhao Qian,Junlong Gao,Deli Zhao,Ran Xu

Main category: cs.CV

TL;DR: 本文提出了RoboSVG,一个统一的多模态框架,用于生成由文本、视觉和数值信号引导的交互式SVG,并构建了包含一百万个样本的RoboDraw数据集以支持该框架。

Details Motivation: SVG在数字设计和机器人控制中至关重要,但缺乏能同时利用多种输入信号生成高质量交互式SVG的统一框架。 Method: RoboSVG首先生成多模态引导信号,通过专用生成模块合成候选SVG,再利用数值引导进行优化;并构建RoboDraw数据集支持四项任务:Text-to-SVG、Image-to-SVG、PartialSVG-to-SVG和PartialImage-to-SVG。 Result: 实验表明,RoboSVG在各项任务中均表现出优异的查询匹配度和视觉保真度,显著优于现有方法。 Conclusion: RoboSVG结合大规模数据集RoboDraw,在多模态SVG生成方面实现了新突破,为交互式矢量图形生成提供了有效解决方案。 Abstract: Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.

[235] Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Shu Zhao,Tianyi Shen,Nilesh Ahuja,Omesh Tickoo,Vijaykrishnan Narayanan

Main category: cs.CV

TL;DR: 提出Windsock和DANCE方法,解决多模态检索增强生成中的动态检索、模态选择与信息利用问题,显著提升生成质量并减少检索开销。

Details Motivation: 现有MRAG方法存在静态检索策略、模态选择不灵活和检索信息利用不佳的问题,难以有效判断何时检索、使用何种模态以及如何充分利用检索内容。 Method: 引入查询依赖的Windsock模块以动态决定是否检索及选择模态,并提出DANCE指令微调策略提升模型对噪声的鲁棒性和检索信息的利用能力;采用自评估方法构建MRAG训练数据。 Result: 实验表明,该方法在生成质量上提升了17.07%,同时减少了8.95%的检索次数。 Conclusion: Windsock与DANCE有效解决了MRAG中的关键挑战,在提升多模态大模型响应准确性的同时降低了计算开销,为高效MRAG系统提供了新思路。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.

[236] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Wenlong Li,Yifei Xu,Yuan Rao,Zhenhua Wang,Shuiguang Deng

Main category: cs.CV

TL;DR: 提出VADTree,一种基于层次化粒度感知树(HGTree)结构的无训练视频异常检测方法,利用预训练通用事件边界检测模型和视觉语言-大语言模型协同推理,实现多粒度异常检测与解释。

Details Motivation: 现有固定长度时序采样方法难以捕捉不同时长跨度的异常事件,且监督方法依赖大量标注数据并缺乏可解释性,而现有无训练方法在灵活采样和细粒度分析方面存在不足。 Method: 构建层次化粒度感知树(HGTree):首先利用GEBD模型提取潜在事件边界,将视频分解为事件节点;通过自适应粗-细分层结构和冗余去除构建树结构;结合多维先验信息,利用VLMs进行节点级异常感知,并通过LLMs进行异常推理;最后采用跨簇节点相关性方法融合多粒度异常分数。 Result: 在三个具有挑战性的数据集上实现了无训练设置下的最先进性能,显著减少了采样的视频片段数量,同时提供了对异常的可解释性分析。 Conclusion: VADTree通过引入HGTree结构和多模态大模型协同推理,有效解决了传统方法在时序灵活性和可解释性方面的局限,为无训练视频异常检测提供了新的高效框架。 Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

[237] M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Huixuan Zhang,Xiaojun Wan

Main category: cs.CV

TL;DR: 本文提出了M$^3$T2IBench,一个大规模、多类别、多实例、多关系的文本到图像对齐评测基准,并设计了与人类评价高度一致的$AlignScore$指标。研究发现现有开源模型在此基准上表现不佳,并提出无需训练的后处理方法Revise-Then-Enforce来提升对齐性能。

Details Motivation: 现有文本到图像对齐评估方法要么局限于简单场景,尤其是忽视同一类别多个实例的复杂提示,要么其指标与人类评价相关性差,因此需要更全面且贴近人类判断的评估基准和方法。 Method: 构建了M$^3$T2IBench基准测试集,结合基于目标检测的$AlignScore$作为评估指标;提出Revise-Then-Enforce方法,通过先修正后强制的方式优化生成图像与文本的对齐。 Result: 实验表明当前开源文本到图像模型在M$^3$T2IBench上表现较差;Revise-Then-Enforce方法在多种扩散模型上均能有效提升图像-文本对齐性能。 Conclusion: M$^3$T2IBench和$AlignScore$为文本到图像对齐提供了更具挑战性和可靠性的评估方案,而Revise-Then-Enforce作为一种训练-free的方法,显著改善了生成结果与文本提示的对齐程度。 Abstract: Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}

[238] UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization

Huixuan Zhang,Xiaojun Wan

Main category: cs.CV

TL;DR: 本文提出了UniAIDet,一个统一且全面的AI生成图像检测基准,涵盖摄影和艺术图像,并支持多种生成模型。通过该基准评估了现有检测方法的泛化能力及其与定位任务的关系。

Details Motivation: 现有AI生成图像检测基准在生成模型和图像类型的覆盖上有限,尤其缺乏对端到端图像编辑和艺术图像的支持,限制了检测方法的评估和研究发展。 Method: 构建了一个名为UniAIDet的综合基准,包含多种生成模型(如文本到图像、图像编辑、修复、深度伪造等)和图像类型(摄影与艺术)。在此基础上系统评估了多种检测方法,并分析其泛化性能及检测与定位之间的关系。 Result: UniAIDet显著扩展了现有基准的覆盖范围,实验揭示了当前检测方法在跨模型和跨类别场景下的泛化能力不足,并发现了检测与定位任务之间存在一定的关联性。 Conclusion: UniAIDet为AI生成图像检测提供了更全面、更贴近实际应用的评估平台,有助于推动未来检测技术的发展与标准化。 Abstract: With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.

[239] WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing

Vittorio Bernuzzi,Leonardo Rossi,Tomaso Fontanini,Massimo Bertozzi,Andrea Prati

Main category: cs.CV

TL;DR: 本文提出了WaveMAE,一种针对多光谱卫星图像的掩码自编码框架,通过离散小波变换和地理条件位置编码提升自监督学习性能,在多个下游任务中表现出色。

Details Motivation: 由于遥感领域标注数据稀缺,全监督方法受限,因此需要有效的自监督学习方法来构建基础模型。 Method: 提出WaveMAE框架,采用多级离散小波变换分离频率成分,并设计基于球谐函数的地理条件位置编码(GPE),在fMoW-S2数据集上预训练,通过PANGAEA基准进行系统评估。 Result: WaveMAE在语义分割和回归等任务上显著优于现有最先进方法,且轻量版本(仅26.4%参数)即可达到SOTA性能。 Conclusion: WaveMAE是一种强大且具有地理感知能力的多光谱遥感影像基础模型,有效提升了自监督表征学习的效果。 Abstract: Self-supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully supervised approaches. In this work, we introduce WaveMAE, a masked autoencoding framework tailored for multispectral satellite imagery. Unlike conventional pixel-based reconstruction, WaveMAE leverages a multi-level Discrete Wavelet Transform (DWT) to disentangle frequency components and guide the encoder toward learning scale-aware high-frequency representations. We further propose a Geo-conditioned Positional Encoding (GPE), which incorporates geographical priors via Spherical Harmonics, encouraging embeddings that respect both semantic and geospatial structure. To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW-S2) and systematically evaluated on the diverse downstream tasks of the PANGAEA benchmark, spanning semantic segmentation, regression, change detection, and multilabel classification. Extensive experiments demonstrate that WaveMAE achieves consistent improvements over prior state-of-the-art approaches, with substantial gains on segmentation and regression benchmarks. The effectiveness of WaveMAE pretraining is further demonstrated by showing that even a lightweight variant, containing only 26.4% of the parameters, achieves state-of-the-art performance. Our results establish WaveMAE as a strong and geographically informed foundation model for multispectral remote sensing imagery.

[240] IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Hao Li,Zhengyu Zou,Fangfu Liu,Xuanyang Zhang,Fangzhou Hong,Yukang Cao,Yushi Lan,Manyuan Zhang,Gang Yu,Dingwen Zhang,Ziwei Liu

Main category: cs.CV

TL;DR: 提出InstanceGrounded Geometry Transformer (IGGT),通过3D一致的对比学习,统一空间重建与实例级语义理解,仅用2D视觉输入实现3D场景的一致性重建和实例区分。

Details Motivation: 现有方法将3D几何重建与高层语义理解分离,或依赖特定语言模型对齐,限制了泛化性和任务适应性,忽略了二者间的协同作用。 Method: 设计端到端的统一Transformer模型IGGT,引入3D-Consistent Contrastive Learning策略,利用2D视觉输入学习包含几何结构和实例聚类的统一表征;构建大规模数据集InsScene-15K,包含高质量RGB、位姿、深度图和3D一致的实例掩码标注。 Result: IGGT实现了从2D输入到3D场景的一致性提升,并能显式区分不同物体实例,在下游3D理解任务中表现出更强的泛化能力。 Conclusion: IGGT有效融合了低层几何重建与高层实例语义理解,验证了统一建模的重要性,为3D场景理解提供了更具泛化性的解决方案。 Abstract: Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.

[241] LRW-Persian: Lip-reading in the Wild Dataset for Persian Language

Zahra Taghizadeh,Mohammad Shahverdikondori,Arian Noori,Alireza Dadgarnia

Main category: cs.CV

TL;DR: 本文介绍了LRW-Persian,这是目前最大的波斯语词级唇读数据集,包含743个目标词和超过41.4万段视频片段,来自67个电视节目。该数据集具有说话人分离的训练/测试划分、广泛的地域和方言覆盖,并配有丰富的元数据。通过全自动的端到端数据处理流程确保数据质量,并基于ASR、主动说话人定位等技术进行筛选。作者微调了两种主流唇读模型,建立了基准性能,展示了波斯语视觉语音识别的挑战性。该数据集填补了低资源语言在唇读领域的空白,支持跨语言迁移和多模态语音研究。

Details Motivation: 非英语视觉语音识别资源有限,特别是在低资源语言中缺乏大规模、高质量的唇读数据集,限制了相关研究的发展。因此,构建一个大规模、标准化的波斯语唇读数据集以推动该领域研究成为迫切需求。 Method: 提出LRW-Persian数据集,采用全自动端到端的数据整理流程,包括基于自动语音识别(ASR)的转录、主动说话人定位、质量过滤以及姿态与遮挡筛查。数据集设计为基准友好型,提供说话人不重叠的训练/测试划分,并包含头姿、年龄、性别等元数据。在此基础上,对两种广泛使用的唇读架构进行微调,建立基准性能。 Result: LRW-Persian成为目前最大的野外采集的波斯语词级唇读数据集,包含743个词汇和超过41.4万段视频,源自1900多小时的电视节目。通过自动化流程确保了大规模数据的质量。在两个主流唇读模型上的微调结果表明,波斯语视觉语音识别任务具有较高难度,为后续研究提供了基准。 Conclusion: LRW-Persian填补了低资源语言在视觉语音识别领域的关键空白,不仅为波斯语唇读提供了首个大规模基准数据集,还支持跨语言迁移研究和多模态语音系统的开发,推动了在代表性不足语言环境下的技术进步。 Abstract: Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English resources for visual speech recognition remain limited. We introduce LRW-Persian, the largest in-the-wild Persian word-level lipreading dataset, comprising $743$ target words and over $414{,}000$ video samples extracted from more than $1{,}900$ hours of footage across $67$ television programs. Designed as a benchmark-ready resource, LRW-Persian provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata including head pose, age, and gender. To ensure large-scale data quality, we establish a fully automated end-to-end curation pipeline encompassing transcription based on Automatic Speech Recognition(ASR), active-speaker localization, quality filtering, and pose/mask screening. We further fine-tune two widely used lipreading architectures on LRW-Persian, establishing reference performance and demonstrating the difficulty of Persian visual speech recognition. By filling a critical gap in low-resource languages, LRW-Persian enables rigorous benchmarking, supports cross-lingual transfer, and provides a foundation for advancing multimodal speech research in underrepresented linguistic contexts. The dataset is publicly available at: https://lrw-persian.vercel.app.

[242] Cross-view Localization and Synthesis -- Datasets, Challenges and Opportunities

Ningli Xu,Rongjun Qin

Main category: cs.CV

TL;DR: 本文综述了跨视角定位与合成领域的最新进展,涵盖了常用数据集、关键技术挑战及前沿方法,并讨论了现有方法的局限性,提出了未来研究方向。

Details Motivation: 跨视角定位与合成为自动驾驶、城市规划和增强现实等应用提供支持,但由于视角、分辨率和遮挡差异大,仍面临挑战。 Method: 将跨视角定位建模为图像检索问题,使用CNN或ViT进行跨视角特征嵌入;跨视角合成则采用GAN或扩散模型生成地面视图。 Result: 系统梳理了该领域的主流数据集、技术方案与性能表现,提供了比较分析,并建立了包含最新方法的开源项目页面。 Conclusion: 跨视角视觉理解在近年取得显著进展,未来需进一步解决模态差异、提升泛化能力,并探索更高效的融合与生成机制。 Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.

[243] ConMatFormer: A Multi-attention and Transformer Integrated ConvNext based Deep Learning Model for Enhanced Diabetic Foot Ulcer Classification

Raihan Ahamed Rifat,Fuyad Hasan Bhoyan,Md Humaion Kabir Mehedi,Md Kaviul Hossain,Md. Jakir Hossen,M. F. Mridha

Main category: cs.CV

TL;DR: 本文提出了一种名为ConMatFormer的新型混合深度学习架构,用于糖尿病足溃疡(DFU)检测,结合了ConvNeXt、多种注意力机制和Transformer模块,显著提升了分类精度与模型可解释性。

Details Motivation: 由于公开可用的DFU数据集稀缺且变异大,传统方法难以准确检测,本文旨在通过融合局部特征提取与全局上下文建模来提升检测性能。 Method: 提出ConMatFormer模型,结合ConvNeXt块提取局部特征,引入CBAM和DANet等注意力机制,并融合Transformer模块增强长距离依赖;采用数据增强缓解类别不平衡问题,并使用Grad-CAM、Grad-CAM++和LIME等XAI方法提升模型可解释性。 Result: 在DS1(DFUC2021)和DS2数据集上,ConMatFormer优于现有的CNN和ViT模型,单次实验准确率达0.8961,精确度达0.9160;四折交叉验证准确率达0.9755,标准差仅为0.0031。XAI分析表明模型决策具有高透明性和可信度。 Conclusion: ConMatFormer在DFU分类任务中表现优异,建立了新的性能基准,并为医学图像分析提供了一个可解释的混合注意力-Transformer框架。 Abstract: Diabetic foot ulcer (DFU) detection is a clinically significant yet challenging task due to the scarcity and variability of publicly available datasets. To solve these problems, we propose ConMatFormer, a new hybrid deep learning architecture that combines ConvNeXt blocks, multiple attention mechanisms convolutional block attention module (CBAM) and dual attention network (DANet), and transformer modules in a way that works together. This design facilitates the extraction of better local features and understanding of the global context, which allows us to model small skin patterns across different types of DFU very accurately. To address the class imbalance, we used data augmentation methods. A ConvNeXt block was used to obtain detailed local features in the initial stages. Subsequently, we compiled the model by adding a transformer module to enhance long-range dependency. This enabled us to pinpoint the DFU classes that were underrepresented or constituted minorities. Tests on the DS1 (DFUC2021) and DS2 (diabetic foot ulcer (DFU)) datasets showed that ConMatFormer outperformed state-of-the-art (SOTA) convolutional neural network (CNN) and Vision Transformer (ViT) models in terms of accuracy, reliability, and flexibility. The proposed method achieved an accuracy of 0.8961 and a precision of 0.9160 in a single experiment, which is a significant improvement over the current standards for classifying DFUs. In addition, by 4-fold cross-validation, the proposed model achieved an accuracy of 0.9755 with a standard deviation of only 0.0031. We further applied explainable artificial intelligence (XAI) methods, such as Grad-CAM, Grad-CAM++, and LIME, to consistently monitor the transparency and trustworthiness of the decision-making process.. Our findings set a new benchmark for DFU classification and provide a hybrid attention transformer framework for medical image analysis.

[244] Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

Jiaxiang Liu,Jiawei Du,Xiao Liu,Prayag Tiwari,Mingkun Xu

Main category: cs.CV

TL;DR: 提出了一种名为Self-Calibrated Consistency (SCC)的测试时防御方法,通过语义一致性和空间一致性模块提升CLIP等视觉-语言模型在零样本设置下的对抗鲁棒性,在22个基准上验证了有效性。

Details Motivation: 现有的CLIP对抗攻击防御方法依赖有标签数据进行微调,难以应用于零样本场景;且当前攻击存在缺乏语义引导和对视角变化敏感两大弱点。 Method: 设计了SCC框架,包含两个模块:1)语义一致性模块,利用反向攻击预热生成的软伪标签和多视角预测来正则化跨模态对齐;2)空间一致性模块,通过增强视图对齐被扰动的视觉预测以稳定推断过程。 Result: 在22个不同攻击场景下的基准测试中,SCC显著提升了CLIP的零样本对抗鲁棒性,同时保持原始精度,并可无缝集成到其他视觉-语言模型(如BioMedCLIP)中带来进一步增益。 Conclusion: SCC为CLIP系列模型提供了一种即插即用的高效测试时防御方案,展示了构建对抗鲁棒性视觉-语言模型的新范式,具有广泛的应用前景。 Abstract: Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks -- lack of semantic guidance and vulnerability to view variations -- collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains such as BioMedCLIP.

[245] MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering

Hai-Dang Nguyen,Minh-Anh Dang,Minh-Tan Le,Minh-Tuan Le

Main category: cs.CV

TL;DR: MedXplain-VQA是一个集成五种可解释AI组件的医疗视觉问答框架,通过改进的Grad-CAM、查询重构和多模态链式推理等技术,显著提升了解释性和诊断可信度。

Details Motivation: 为了提高医生对AI生成诊断的信任,需要透明的推理过程,因此在医学VQA系统中实现可解释性至关重要。 Method: 采用微调的BLIP-2模型为基础,结合医学查询重构、增强的Grad-CAM注意力机制、精确区域提取以及基于多模态语言模型的结构化链式思维推理。 Result: 在500个PathVQA组织病理样本上实验显示,复合评分从基线0.378提升至0.683,推理置信度达0.890,能识别每样本3-5个诊断相关区域,生成平均57词的结构化解释。 Conclusion: MedXplain-VQA展现出作为可靠、可解释医学VQA系统的潜力,未来将进行医学专家验证和大规模临床数据测试以推动临床应用。 Abstract: Explainability is critical for the clinical adoption of medical visual question answering (VQA) systems, as physicians require transparent reasoning to trust AI-generated diagnoses. We present MedXplain-VQA, a comprehensive framework integrating five explainable AI components to deliver interpretable medical image analysis. The framework leverages a fine-tuned BLIP-2 backbone, medical query reformulation, enhanced Grad-CAM attention, precise region extraction, and structured chain-of-thought reasoning via multi-modal language models. To evaluate the system, we introduce a medical-domain-specific framework replacing traditional NLP metrics with clinically relevant assessments, including terminology coverage, clinical structure quality, and attention region relevance. Experiments on 500 PathVQA histopathology samples demonstrate substantial improvements, with the enhanced system achieving a composite score of 0.683 compared to 0.378 for baseline methods, while maintaining high reasoning confidence (0.890). Our system identifies 3-5 diagnostically relevant regions per sample and generates structured explanations averaging 57 words with appropriate clinical terminology. Ablation studies reveal that query reformulation provides the most significant initial improvement, while chain-of-thought reasoning enables systematic diagnostic processes. These findings underscore the potential of MedXplain-VQA as a robust, explainable medical VQA system. Future work will focus on validation with medical experts and large-scale clinical datasets to ensure clinical readiness.

[246] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

Fatemeh Nazarieh,Zhenhua Feng,Diptesh Kanojia,Muhammad Awais,Josef Kittler

Main category: cs.CV

TL;DR: MAGIC-Talk是一种基于单样本扩散的可定制说话人脸生成框架,通过ReferenceNet和AnimateNet实现身份保持、细粒度编辑和运动连贯性,并采用渐进式潜在融合策略提升长视频质量。

Details Motivation: 现有音频驱动说话人脸生成方法在长时间生成中存在时间不一致性、身份保持和定制化困难的问题。 Method: 提出MAGIC-Talk框架,包含用于身份保持和文本引导编辑的ReferenceNet,以及利用结构化运动先验增强动作连贯性的AnimateNet;引入渐进式潜在融合策略以减少运动不一致和闪烁。 Result: 实验表明,MAGIC-Talk在视觉质量、身份保持和同步准确性方面优于现有最先进方法。 Conclusion: MAGIC-Talk为可定制且时间稳定的说话人脸生成提供了一个鲁棒的解决方案,尤其适用于长视频生成场景。 Abstract: Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.

[247] FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment

Zahraa Al Sahili,Maryam Fetanat,Maimuna Nowaz,Ioannis Patras,Matthew Purver

Main category: cs.CV

TL;DR: 本文提出了一种名为FairJudge的轻量级评估协议,利用多模态大模型作为公平评判者,通过解释性评分标准评估文本到图像生成系统的公平性和提示对齐程度,相较于传统方法更具可解释性和可靠性。

Details Motivation: 现有的文本到图像系统缺乏可复现且公正的评估方式,常用方法依赖表面特征且无法处理弱可见的社会属性(如宗教、文化、残障等),因此需要一种更公平、透明的评估机制。 Method: FairJudge将指令遵循的多模态大模型视为评判者,采用[-1,1]范围内的解释性评分标准,限制标签集,要求判断基于图像可见内容并提供证据,在线索不足时主动 abstain,从而实现可追溯、有依据的评估。 Result: 在多个数据集上,FairJudge在人口统计预测和职业准确性方面优于CLIP和面部中心基线方法,提升了平均对齐得分,并发布了包含469张多样非刻板场景的新数据集DIVERSIFY。 Conclusion: FairJudge提供了一种可问责、证据感知的评估方式,能够有效提升文本到图像生成模型在社会属性上的公平性评估能力,支持更可靠和可复现的公平性审计。 Abstract: Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.

[248] LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

Aleksandar Pramov

Main category: cs.CV

TL;DR: 本文提出了一种基于Gemma-3 LLM的多模态融合系统,用于预测广告的商业记忆性,利用LoRA微调并结合ViT和E5特征,通过LLM生成的推理提示提升模型性能,结果表明该方法比基于梯度提升树的基线更具鲁棒性和泛化能力。

Details Motivation: 旨在解决MediaEval 2025竞赛中广告记忆性预测任务的挑战,探索多模态融合与大语言模型在记忆性建模中的潜力。 Method: 采用Gemma-3 LLM作为骨干网络,通过多模态投影融合预计算的视觉(ViT)和文本(E5)特征,并应用Low-Rank Adaptation(LoRA)进行适配;同时使用LLM生成基于专家定义记忆性维度的推理提示来指导融合过程。 Result: LLM-based融合系统在最终测试集上相比基于梯度提升树的强基线表现出更优的鲁棒性和泛化性能。 Conclusion: 结合LLM生成的理性提示与多模态特征融合的方法能有效提升广告记忆性预测的性能,验证了大语言模型在该任务中的有效性与潜力。 Abstract: This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability

[249] Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models

Aya Nakayama,Brian Wong,Yuji Nishimura,Kaito Tanaka

Main category: cs.CV

TL;DR: 提出了一种新的框架SP-CSVR,用于提升大视觉语言模型在不同视觉风格下的语义理解与跨风格推理能力。

Details Motivation: 解决大视觉语言模型在多样化视觉风格下存在的“风格陷阱”问题,提升其在上下文学习中的语义理解鲁棒性。 Method: 设计了包含跨风格特征编码器(CSFE)、语义对齐的上下文解码器(SAICD)和自适应语义一致性模块(ASCM)的SP-CSVR框架,通过解耦风格与内容、少样本风格适应和多任务对比学习实现稳定语义理解。 Result: 在多风格数据集上的实验表明,SP-CSVR在图像描述生成、视觉问答和上下文风格适应任务中均达到最先进性能,并验证了其在鲁棒性、泛化性和效率方面的优势。 Conclusion: SP-CSVR能有效应对风格陷阱,显著提升大视觉语言模型在多样化视觉风格下的语义理解和跨风格推理能力。 Abstract: The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR's state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation. Comprehensive evaluations, including ablation studies and generalization analysis, confirm SP-CSVR's efficacy in enhancing robustness, generalization, and efficiency across diverse visual styles.

[250] FastJAM: a Fast Joint Alignment Model for Images

Omri Hirsch,Ron Shapira Weber,Shira Ifergane,Oren Freifeld

Main category: cs.CV

TL;DR: FastJAM是一种快速、基于图的图像联合对齐方法,利用现成的图像匹配器和非参数聚类构建关键点关系图,并通过图神经网络高效预测单应性参数,显著降低计算复杂度,在多个基准上优于现有方法。

Details Motivation: 现有的图像联合对齐方法通常需要长时间训练、大容量模型和大量超参数调优,因此需要一种更快速、高效的替代方案。 Method: FastJAM利用现成的图像匹配器提取的成对匹配,结合快速非参数聚类构建表征图像内外关键点关系的图;通过图神经网络传播和聚合对应关系,并使用图像级池化预测每幅图像的单应性参数,结合无需正则项的逆向合成损失函数进行优化。 Result: 在多个基准上的实验表明,FastJAM在对齐质量上优于现有的现代方法,同时将计算时间从小时或分钟级别缩短至数秒。 Conclusion: FastJAM是一种高效、快速的图像联合对齐方法,有效降低了计算成本和超参数调优需求,适用于大规模图像对齐任务。 Abstract: Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, https://bgu-cs-vil.github.io/FastJAM/

[251] Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models

Lexiang Xiong,Chengyu Liu,Jingwen Ye,Yan Liu,Yuecong Xu

Main category: cs.CV

TL;DR: 提出了一种无需训练、零样本的文本到图像扩散模型中的概念擦除框架Semantic Surgery,通过直接操作文本嵌入来动态消除有害概念,同时保持生成质量。

Details Motivation: 现有概念擦除方法常损害生成质量,需在不降低图像质量的前提下有效去除文本提示中的有害或敏感概念。 Method: 在扩散模型的文本嵌入阶段引入语义手术(Semantic Surgery),动态检测目标概念并进行校准的向量减法;包含共现编码模块和视觉反馈循环以提升多概念擦除效果和应对潜在概念残留。 Result: 在物体、显式内容、艺术风格和多名人擦除任务中显著优于现有方法,实现93.58的H-score(物体擦除)、仅1例显式内容残留、8.09的H_a(风格擦除),且无图像质量下降。 Conclusion: Semantic Surgery是一种高效、无需训练的概念擦除方法,能在保持生成质量和局部性的同时实现全面擦除,并可作为内置威胁检测系统用于更安全的文本到图像生成。 Abstract: Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.

[252] Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models

Yang Zhang,Qianyu Zhou,Farhad Imani,Jiong Tang

Main category: cs.CV

TL;DR: 提出一种基于检索增强生成(RAG)与视觉-语言模型(VLM)的零样本风力机叶片损伤检测框架,利用多模态知识库实现无需大量标注数据的高效、可解释性检测。

Details Motivation: 传统深度学习方法依赖大量标注数据,难以应对罕见或新型损伤类型,且数据获取困难;因此需要一种不依赖大规模标注数据的检测方法。 Method: 构建包含技术文档、参考图像和领域指南的多模态知识库,采用关键词感知重排序的混合文本-图像检索器,在推理时为VLM提供相关上下文,实现无需任务特定训练的知识注入。 Result: 在30张标注叶片图像上测试,RAG增强的VLM准确分类所有样本,显著优于无检索的VLM及开放词汇基线;小样本下引入Clopper-Pearson置信区间评估不确定性。 Conclusion: 该框架通过知识检索提升VLM的准确性、可解释性和泛化能力,减少对标注数据的依赖,为工业检测提供数据高效的解决方案。 Abstract: Wind turbine blades operate in harsh environments, making timely damage detection essential for preventing failures and optimizing maintenance. Drone-based inspection and deep learning are promising, but typically depend on large, labeled datasets, which limit their ability to detect rare or evolving damage types. To address this, we propose a zero-shot-oriented inspection framework that integrates Retrieval-Augmented Generation (RAG) with Vision-Language Models (VLM). A multimodal knowledge base is constructed, comprising technical documentation, representative reference images, and domain-specific guidelines. A hybrid text-image retriever with keyword-aware reranking assembles the most relevant context to condition the VLM at inference, injecting domain knowledge without task-specific training. We evaluate the framework on 30 labeled blade images covering diverse damage categories. Although the dataset is small due to the difficulty of acquiring verified blade imagery, it covers multiple representative defect types. On this test set, the RAG-grounded VLM correctly classified all samples, whereas the same VLM without retrieval performed worse in both accuracy and precision. We further compare against open-vocabulary baselines and incorporate uncertainty Clopper-Pearson confidence intervals to account for the small-sample setting. Ablation studies indicate that the key advantage of the framework lies in explainability and generalizability: retrieved references ground the reasoning process and enable the detection of previously unseen defects by leveraging domain knowledge rather than relying solely on visual cues. This research contributes a data-efficient solution for industrial inspection that reduces dependence on extensive labeled datasets.

[253] Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture

Qiyu Liao,Dadong Wang,Rebecca Haling,Jiajun Liu,Xun Li,Martyna Plomecka,Andrew Robson,Matthew Pringle,Rhys Pirie,Megan Walker,Joshua Whelan

Main category: cs.CV

TL;DR: 本文介绍了一个包含1,162张标注的牧场顶视图像的数据集,用于精准放牧管理中的生物量估算。

Details Motivation: 准确估计牧场生物量对畜牧业决策至关重要,有助于优化载畜率、防止过度放牧并促进系统健康。 Method: 在澳大利亚19个地点采集多季节、多种温带牧草的顶视图像,每张图像对应70cm*30cm样方,并结合地面测量数据(如生物量组分、植被高度和NDVI)。 Result: 该数据集整合了视觉、光谱和结构信息,可用于机器学习模型训练,并已在Kaggle平台上发布以推动国际竞赛。 Conclusion: 该多维数据集为精准放牧管理和基于AI的生物量估算提供了重要资源。 Abstract: Accurate estimation of pasture biomass is important for decision-making in livestock production systems. Estimates of pasture biomass can be used to manage stocking rates to maximise pasture utilisation, while minimising the risk of overgrazing and promoting overall system health. We present a comprehensive dataset of 1,162 annotated top-view images of pastures collected across 19 locations in Australia. The images were taken across multiple seasons and include a range of temperate pasture species. Each image captures a 70cm * 30cm quadrat and is paired with on-ground measurements including biomass sorted by component (green, dead, and legume fraction), vegetation height, and Normalized Difference Vegetation Index (NDVI) from Active Optical Sensors (AOS). The multidimensional nature of the data, which combines visual, spectral, and structural information, opens up new possibilities for advancing the use of precision grazing management. The dataset is released and hosted in a Kaggle competition that challenges the international Machine Learning community with the task of pasture biomass estimation. The dataset is available on the official Kaggle webpage: https://www.kaggle.com/competitions/csiro-biomass

[254] Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression

Pranav Saxena

Main category: cs.CV

TL;DR: 本文提出Gen-LangSplat,通过在ScanNet上预训练的通用自动编码器替代场景特定编码器,消除每场景训练瓶颈,实现高效、可扩展的3D开放词汇语言场建模。

Details Motivation: 现有方法如LangSplat需为每个场景单独训练语言自动编码器,造成高昂的优化成本,限制了实际部署的可扩展性。 Method: 采用在大规模ScanNet数据集上预训练的通用自动编码器,取代原有场景特定的编码器,使用固定且紧凑的潜在空间表示语言特征,无需针对新场景进行额外训练。 Result: 在无需场景特定训练的情况下,语言场构建效率显著提升,查询性能达到或超过原始LangSplat方法;通过消融实验确定最优潜在维度,并用MSE和余弦相似度验证了CLIP特征重建的保真度。 Conclusion: 通用自动编码器可有效支持跨场景的开放词汇3D查询,为可扩展、实时交互式3D AI应用提供了可行路径。 Abstract: Modeling open-vocabulary language fields in 3D is essential for intuitive human-AI interaction and querying within physical environments. State-of-the-art approaches, such as LangSplat, leverage 3D Gaussian Splatting to efficiently construct these language fields, encoding features distilled from high-dimensional models like CLIP. However, this efficiency is currently offset by the requirement to train a scene-specific language autoencoder for feature compression, introducing a costly, per-scene optimization bottleneck that hinders deployment scalability. In this work, we introduce Gen-LangSplat, that eliminates this requirement by replacing the scene-wise autoencoder with a generalized autoencoder, pre-trained extensively on the large-scale ScanNet dataset. This architectural shift enables the use of a fixed, compact latent space for language features across any new scene without any scene-specific training. By removing this dependency, our entire language field construction process achieves a efficiency boost while delivering querying performance comparable to, or exceeding, the original LangSplat method. To validate our design choice, we perform a thorough ablation study empirically determining the optimal latent embedding dimension and quantifying representational fidelity using Mean Squared Error and cosine similarity between the original and reprojected 512-dimensional CLIP embeddings. Our results demonstrate that generalized embeddings can efficiently and accurately support open-vocabulary querying in novel 3D scenes, paving the way for scalable, real-time interactive 3D AI applications.

[255] Positional Preservation Embedding for Multimodal Large Language Models

Mouxiao Huang,Borui Jiang,Dehua Zheng,Hailin Hu,Kai Han,Xinghao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为位置保持嵌入(PPE)的新编码算子,旨在压缩视觉token的同时保留时空结构,从而提升多模态大语言模型的效率和性能。

Details Motivation: 现有的视觉token合并方法在减少序列长度时常常忽略位置关系,破坏了空间布局和时间连续性,导致模型性能下降。 Method: 提出PPE算子,通过在token维度上显式引入3D位置的解耦编码,使每个压缩后的token能包含多个原始token的位置信息,并支持级联聚类的渐进式压缩策略。PPE无需额外参数,可无缝集成到现有方法中。 Result: 在MMBench、TextVQA和VideoMME等多个视觉-语言基准上实现了2%~5%的一致性能提升。 Conclusion: 保留位置线索对于高效且有效的多模态大语言模型推理至关重要,PPE为token压缩提供了一种通用且有效的新思路。 Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering -- a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2\%\sim5\%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning.

[256] Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics

Matthew So,Judah Goldfeder,Mark Lis,Hod Lipson

Main category: cs.CV

TL;DR: 该研究通过训练双编码器网络,验证了个体生物特征(如指纹和虹膜)之间的相关性,发现同一个人的左右虹膜具有显著相关性,而跨模态(指纹-虹膜)匹配仅略高于随机水平,挑战了生物特征相互独立的传统假设。

Details Motivation: 挑战生物特征统计无关的传统假设,探索不同生物特征之间的内在关联。 Method: 使用ResNet-50和Vision Transformer作为骨干网络,在双编码器架构中通过对比损失进行训练,评估指纹-指纹、虹膜-虹膜及跨模态指纹-虹膜匹配性能。 Result: 虹膜ResNet模型在虹膜-虹膜匹配中达到91 ROC AUC,表明左右虹膜存在强相关性;指纹模型结果与先前研究一致;跨模态匹配效果有限,略高于随机猜测。 Conclusion: 个体的不同生物特征并非完全独立,尤其是虹膜特征具有显著相关性,传统独立性假设需重新审视;未来将扩展至更多生物特征研究。 Abstract: There has been a historic assumption that the biometrics of an individual are statistically uncorrelated. We test this assumption by training Bi-Encoder networks on three verification tasks, including fingerprint-to-fingerprint matching, iris-to-iris matching, and cross-modal fingerprint-to-iris matching using 274 subjects with $\sim$100k fingerprints and 7k iris images. We trained ResNet-50 and Vision Transformer backbones in Bi-Encoder architectures such that the contrastive loss between images sampled from the same individual is minimized. The iris ResNet architecture reaches 91 ROC AUC score for iris-to-iris matching, providing clear evidence that the left and right irises of an individual are correlated. Fingerprint models reproduce the positive intra-subject suggested by prior work in this space. This is the first work attempting to use Vision Transformers for this matching. Cross-modal matching rises only slightly above chance, which suggests that more data and a more sophisticated pipeline is needed to obtain compelling results. These findings continue challenge independence assumptions of biometrics and we plan to extend this work to other biometrics in the future. Code available: https://github.com/MatthewSo/bio_fingerprints_iris.

[257] Switchable Token-Specific Codebook Quantization For Face Image Compression

Yongbo Wang,Haonan Wang,Guodong Mu,Ruixin Zhang,Jiaqi Chen,Jingyun Zhang,Jun Wang,Yuan Xie,Zhizhong Zhang,Shouhong Ding

Main category: cs.CV

TL;DR: 提出了一种可切换的令牌特定码本量化方法(Switchable Token-Specific Codebook Quantization)用于人脸图像压缩,通过为不同图像类别学习不同的码本组,并为每个令牌分配独立码本,在低比特率下显著提升重建性能和语义保持能力。

Details Motivation: 现有基于全局共享码本的方法在处理富含属性的人脸图像时,忽略了图像内部的类别相关性和令牌间的语义差异,导致在低比特每像素(bpp)下性能不佳。 Method: 提出一种可切换的令牌特定码本量化方法,为不同图像类别学习不同的码本组,并为每个令牌分配独立的码本;通过少量比特记录令牌所属的码本组,从而在降低单个码本组大小的同时减少失真,实现更低bpp下的更高效表示。 Result: 该方法在人脸识别数据集上验证有效,在0.05 bpp下重建图像的平均识别准确率达到93.51%,优于传统全局码本方法。 Conclusion: 所提方法通过细粒度的码本设计提升了人脸图像在极低比特率下的压缩与重建质量,尤其有利于语义任务的恢复,且具有良好的通用性,可集成到现有的码本-based 表示学习框架中。 Abstract: With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.

[258] LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang,Zilong Chen,Chenhui Gou,Feng Li,Chaorui Deng,Deyao Zhu,Kunchang Li,Weihao Yu,Haoqin Tu,Haoqi Fan,Cihang Xie

Main category: cs.CV

TL;DR: 本文提出了一种高效的统一多模态模型构建方法,通过融合现有的生成与理解专用模型,并引入双融合机制,在保留原模型优势的同时实现了强大的跨模态协同性能。

Details Motivation: 现有统一多模态模型大多从零训练,计算资源消耗大,本文旨在探索更高效的模型融合策略以降低训练成本并保持竞争力。 Method: 在不修改原有模型结构的基础上,通过在网络中交错插入多模态自注意力模块,实现生成模型(低层空间信号)与理解模型(高层语义表示)的双重融合。 Result: 仅使用约350亿token训练,该方法在多个基准上取得优异表现:GenEval得分为0.91,DPG-Bench为82.16,GEditBench为6.06,ImgEdit-Bench为3.77。 Conclusion: 通过战略性融合现有多模态模型并引入轻量级交叉注意力结构,可在较低资源消耗下实现高性能统一多模态建模,且作者已全面开源代码、权重和数据集以支持后续研究。 Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

[259] FAME: Fairness-aware Attention-modulated Video Editing

Zhangkai Wu,Xuhui Fan,Zhongyuan Xie,Kaize Shi,Zhidong Li,Longbing Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为FAME的无需训练的视频编辑方法,旨在减少职业相关性别偏见,同时保持提示对齐和时间一致性。

Details Motivation: 现有的视频编辑模型在处理职业相关提示时容易强化性别刻板印象,缺乏公平性考虑。 Method: 通过软注入去偏令牌生成公平性嵌入,并将其融入时间自注意力和提示到区域交叉注意力机制中;采用区域约束注意力掩码与时间衰减加权,提升区域内连贯性,抑制无关区域交互;通过引入公平敏感相似性掩码重权衡令牌与区域匹配得分。 Result: 在新构建的公平性导向视频编辑基准FairVE上实验表明,FAME在公平性对齐和语义保真度方面优于现有基线方法。 Conclusion: FAME有效缓解了视频编辑中的职业性别偏见,同时保持了良好的时间一致性和语义一致性,为公平感知的视频编辑提供了可行方案。 Abstract: Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.

[260] Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges

Liling Yang,Ning Chen,Jun Yue,Yidan Liu,Jiayi Ma,Pedram Ghamisi,Antonio Plaza,Leyuan Fang

Main category: cs.CV

TL;DR: 本文综述了多模态地理空间基础模型(GFMs)的研究进展,从模态驱动的角度系统分析了遥感图像处理中的关键技术和挑战,并探讨了其在多个实际场景中的应用潜力与未来研究方向。

Details Motivation: 遥感数据具有多模态、多分辨率和多时相的特点,传统方法难以有效应对;基础模型的强大泛化和迁移能力为解决这些问题提供了新机遇。 Method: 从五种核心视觉与视觉-语言模态出发,分析成像物理与数据表示差异对交互设计的影响,系统梳理对齐、融合与知识迁移的关键技术,并评估训练范式、架构设计与任务适配策略。 Result: 评估了代表性多模态GFMs在十个下游任务中的表现,展示了其在土地覆盖制图、农业监测、灾害响应等实际案例中的应用效果,并总结了当前主流基准和性能表现。 Conclusion: 多模态GFMs在遥感领域展现出巨大潜力,但仍面临领域泛化、可解释性、效率和隐私等方面的挑战,未来需进一步探索更高效、鲁棒和可信的模型设计。 Abstract: Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.

[261] VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Zhangkai Wu,Xuhui Fan,Zhongyuan Xie,Kaize Shi,Longbing Cao

Main category: cs.CV

TL;DR: 本文提出了VALA(变分对齐用于潜在锚点),一种用于训练-free视频编辑的变分对齐模块,能够自适应选择关键帧并压缩其潜在特征以实现一致的视频编辑。

Details Motivation: 现有方法在DDIM反演过程中依赖启发式帧选择来维持时间一致性,这引入了人工偏差并降低了端到端推理的可扩展性。 Method: 提出了一种带有对比学习目标的变分框架,用以学习有意义的分配,将跨帧潜在表示转换为保持内容和时间一致性的压缩潜在锚点。 Result: 在真实世界的视频编辑基准上进行了大量实验,结果表明VALA在反演保真度、编辑质量和时间一致性方面达到了最先进的性能,并且比之前的方法更高效。 Conclusion: VALA可以完全集成到无需训练的基于文本到图像的视频编辑模型中,在提升编辑效果的同时增强了效率和一致性。 Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

[262] Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

Bohan Li,Xin Jin,Hu Zhu,Hongsi Liu,Ruikai Li,Jiazhe Guo,Kaiwen Cai,Chao Ma,Yueming Jin,Hao Zhao,Xiaokang Yang,Wenjun Zeng

Main category: cs.CV

TL;DR: 本文提出了Nuplan-Occ,目前最大的语义占据数据集,并基于此开发了一个统一的驾驶场景生成框架,能够联合生成高质量的语义占据、多视角视频和LiDAR点云。

Details Motivation: 现有占据感知方法依赖大量标注数据,但此类数据稀缺,限制了生成模型的发展。因此需要构建大规模占据数据集并设计高效生成框架。 Method: 提出一种时空解耦架构,结合高斯点阵渲染策略和传感器感知嵌入策略,实现4D动态占据的高保真空间扩展与时间预测,联合生成多模态驾驶场景数据。 Result: 在Nuplan-Occ上验证了方法的优越性,生成质量与可扩展性优于现有方法,并在下游任务中展现出实用价值。 Conclusion: 该统一框架为自动驾驶中的多模态场景生成提供了有效解决方案,推动了基于占据表示的生成模型在实际应用中的发展。 Abstract: Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

[263] VoMP: Predicting Volumetric Mechanical Property Fields

Rishit Dagli,Donglai Xiang,Vismay Modi,Charles Loop,Clement Fuji Tsang,Anka He Chen,Anita Hu,Gavriel State,David I. W. Levin,Maria Shugrina

Main category: cs.CV

TL;DR: VoMP是一种前馈方法,用于预测3D物体内部各体素的杨氏模量、泊松比和密度,利用多视角特征和几何变换器,在物理合理的材料流形上生成有效的材料属性。

Details Motivation: 物理仿真依赖于空间变化的机械属性,而这些属性通常需要手工设计,费时费力。因此需要一种自动化的方法来准确预测3D物体内部的材料属性。 Method: VoMP通过聚合每个体素的多视角特征,并将其输入训练好的几何变换器(Geometry Transformer)来预测每个体素的材料潜在编码。这些潜在编码位于从真实世界数据集中学习到的物理可实现材料流形上,确保解码后的材料属性有效且合理。 Result: 实验表明,VoMP在估计体积属性方面具有高精度,并在准确性和速度上显著优于现有方法。此外,提出了一种结合分割3D数据集、材料数据库和视觉-语言模型的标注流程及新基准。 Conclusion: VoMP能够高效、准确地预测任意可渲染和体素化的3D对象的体素级材料属性,推动了物理仿真中材料建模的自动化发展。 Abstract: Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus ($E$), Poisson's ratio ($\nu$), and density ($\rho$) throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far outperforming prior art in accuracy and speed.

[264] SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

Quanjian Song,Donghao Zhou,Jingyu Lin,Fei Shen,Jiaze Wang,Xiaowei Hu,Cunjian Chen,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了SceneDecorator,一种无需训练的框架,用于解决文本到图像生成中的场景规划和场景一致性问题,从而提升故事生成的连贯性和创造性。

Details Motivation: 现有文本到图像模型在保持概念一致性方面存在不足,尤其忽视了场景在叙事中的重要作用,限制了实际应用中的创造力。 Method: 提出SceneDecorator框架,采用VLM引导的场景规划实现全局到局部的叙事连贯性,并通过长时场景共享注意力机制维持跨故事的场景一致性和主体多样性。 Result: 大量实验表明,SceneDecorator在场景规划和一致性方面表现优越,显著提升了生成故事的连贯性和多样性。 Conclusion: SceneDecorator有效解决了场景级叙事连贯性和长期场景一致性问题,在艺术、电影和游戏等创意领域具有广泛应用潜力。 Abstract: Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.

[265] LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation

Md Mostafijur Rahman,Radu Marculescu

Main category: cs.CV

TL;DR: 本文提出了LoMix(Logits Mixing),一种受神经架构搜索(NAS)启发的可微分模块,通过在U型网络中融合多尺度logits并学习各尺度混合的损失权重,显著提升分割性能,具有零推理开销、高数据效率和良好泛化性。

Details Motivation: 现有的U型网络训练方法通常孤立地处理多尺度logits,仅监督最终输出或对所有尺度使用相同权重的深度监督,忽略了粗略与精细预测融合带来的互补信息,限制了性能提升。 Method: LoMix引入四种轻量级融合操作(加法、乘法、拼接和注意力加权融合)生成多尺度的‘突变’logits图,并为每个原始或突变图分配可学习的softplus损失权重,与网络参数联合优化,模拟一步架构搜索,自动发现最优的尺度、混合方式和操作。 Result: 在Synapse 8器官数据集上,LoMix结合PVT-V2-B2与EMCAD解码器,相比单输出监督DICE提升+4.2%,深度监督提升+2.2%,等权融合提升+1.5%;数据稀缺时优势更大(+9.23%);在四个基准和多种U型网络上DICE最高提升+13.5%。 Conclusion: LoMix通过可学习的加权多尺度logits融合,有效利用了不同尺度间的互补信息,在不增加推理开销的前提下显著提升了分割性能,具备良好的数据效率、泛化能力和可解释性。 Abstract: U-shaped networks output logits at multiple spatial scales, each capturing a different blend of coarse context and fine detail. Yet, training still treats these logits in isolation - either supervising only the final, highest-resolution logits or applying deep supervision with identical loss weights at every scale - without exploring mixed-scale combinations. Consequently, the decoder output misses the complementary cues that arise only when coarse and fine predictions are fused. To address this issue, we introduce LoMix (Logits Mixing), a NAS-inspired, differentiable plug-and-play module that generates new mixed-scale outputs and learns how exactly each of them should guide the training process. More precisely, LoMix mixes the multi-scale decoder logits with four lightweight fusion operators: addition, multiplication, concatenation, and attention-based weighted fusion, yielding a rich set of synthetic mutant maps. Every original or mutant map is given a softplus loss weight that is co-optimized with network parameters, mimicking a one-step architecture search that automatically discovers the most useful scales, mixtures, and operators. Plugging LoMix into recent U-shaped architectures (i.e., PVT-V2-B2 backbone with EMCAD decoder) on Synapse 8-organ dataset improves DICE by +4.2% over single-output supervision, +2.2% over deep supervision, and +1.5% over equally weighted additive fusion, all with zero inference overhead. When training data are scarce (e.g., one or two labeled scans), the advantage grows to +9.23%, underscoring LoMix's data efficiency. Across four benchmarks and diverse U-shaped networks, LoMiX improves DICE by up to +13.5% over single-output supervision, confirming that learnable weighted mixed-scale fusion generalizes broadly while remaining data efficient, fully interpretable, and overhead-free at inference. Our code is available at https://github.com/SLDGroup/LoMix.

[266] CoMo: Compositional Motion Customization for Text-to-Video Generation

Youcan Xu,Zhen Wang,Jiaxin Shi,Kexin Li,Feifei Shao,Jun Xiao,Yi Yang,Jun Yu,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为CoMo的新框架,用于文本到视频生成中的组合式运动定制,通过解耦静态与动态信息以及分而治之的策略,实现单视频中多主体、多运动的精确控制。

Details Motivation: 现有文本到视频模型在复杂多主体运动的精确控制方面表现不佳,且单运动定制方法难以应对组合场景,主要受限于运动与外观的纠缠及多运动融合效果差。 Method: CoMo采用两阶段方法:第一阶段通过静态-动态解耦的微调范式学习运动特异性模块;第二阶段利用即插即用的分治融合策略,在去噪过程中空间隔离各运动影响,实现无需额外训练的多运动组合。 Result: 实验表明CoMo在多运动保真度和融合效果上达到SOTA水平,显著提升了可控视频生成的能力,并提出了新的基准和评估指标。 Conclusion: CoMo有效解决了多主体运动定制中的运动-外观纠缠和多运动融合问题,推动了文本到视频生成中复杂运动控制的发展。 Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.

[267] UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds

Pan Zhao,Hui Yuan,Chongzhen Tian,Tian Guo,Raouf Hamzaoui,Zhigeng Pan

Main category: cs.CV

TL;DR: 本文提出了一种统一的点云几何与属性增强框架(UGAE),包含几何后处理增强、属性前处理重着色和属性后处理增强三个模块,显著提升了压缩后点云的质量。

Details Motivation: 有损压缩会导致点云几何结构和属性信息失真,影响重建质量,因此需要有效的增强方法来恢复细节并提升视觉感知质量。 Method: UGAE框架包括:1)基于Transformer的稀疏卷积U-Net用于几何结构精确重建;2)基于增强几何引导的细节感知K近邻(DA-KNN)重着色策略用于属性预增强;3)结合加权MSE损失的属性残差预测网络用于解码端属性增强。 Result: 在8iVFB、Owlii和MVUB三个基准数据集上,UGAE显著优于现有方法。相比G-PCC测试模型TMC13v29,在D1指标下几何部分平均BD-PSNR提升9.98 dB,节省90.98%比特率;Y分量属性BD-PSNR提升3.67 dB,节省56.88%比特率,并显著改善感知质量。 Conclusion: UGAE通过协同优化几何与属性增强,有效补偿了有损压缩带来的失真,在几何重建精度、属性保真度和感知质量方面均取得显著提升。 Abstract: Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with 56.88% BD-bitrate savings for attributes on the Y component. Additionally, it improved perceptual quality significantly.

[268] Nested AutoRegressive Models

Hongyu Wu,Xuhui Fan,Zhangkai Wu,Longbing Cao

Main category: cs.CV

TL;DR: 提出了一种嵌套自回归模型NestAR,通过多尺度分层结构和连续token的流匹配损失,在降低计算复杂度的同时提升图像生成多样性和效率。

Details Motivation: 现有的自回归图像生成模型计算开销大且样本多样性有限,需要更高效的生成架构。 Method: 设计了嵌套自回归结构,采用多尺度模块分层生成图像,每个模块内用AR生成token块,并引入流匹配损失和协调训练目标。 Result: 将生成n个图像token的复杂度从O(n)降低至O(log n),显著减少计算成本,同时提升样本多样性,在图像生成任务中达到具有竞争力的性能。 Conclusion: NestAR通过新颖的嵌套AR架构,在保证生成质量的同时大幅提高效率,为自回归图像生成提供了更优的解决方案。 Abstract: AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.

[269] HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

Joungbin An,Kristen Grauman

Main category: cs.CV

TL;DR: 提出HieraMamba,一种基于层次化架构的视频时序定位方法,利用Anchor-MambaPooling模块和对比损失实现长未剪辑视频中自然语言查询的精确时间定位。

Details Motivation: 现有方法在处理长视频时因过度下采样或固定窗口而丢失时间细节,难以兼顾全局上下文与细粒度时间信息。 Method: 设计HieraMamba层次架构,引入Anchor-MambaPooling(AMP)模块,利用Mamba的选择性扫描生成多粒度的锚点token;采用锚点条件对比损失和片段池化对比损失,保持局部细节与全局判别能力。 Result: 在Ego4D-NLQ、MAD和TACoS数据集上达到最先进性能,实现了对长未剪辑视频中查询内容的精准、时间保真的定位。 Conclusion: HieraMamba通过层次化建模和Mamba机制有效保留了视频的时间结构与语义丰富性,显著提升了长视频时序接地的性能。 Abstract: Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

[270] Strategies for Robust Deep Learning Based Deformable Registration

Joel Honkamaa,Pekka Marttinen

Main category: cs.CV

TL;DR: 本文提出了一种通过将图像转换到MIND特征空间来提升深度学习配准模型在未见模态上泛化能力的简单而有效的方法,并结合特殊集成策略实现了性能的稳定提升。

Details Motivation: 现有的基于深度学习的可变形配准方法在训练数据分布之外的泛化能力较差,限制了其实际应用。LUMIR脑部配准挑战赛旨在评估模型在训练集未包含的对比度和模态下的表现,推动该领域发展。 Method: 将输入图像转换到MIND(Modality Independent Neighbourhood Descriptor)特征空间,以增强模型对不同模态的鲁棒性;并采用一种特殊的集成策略进行预测融合。 Result: 所提方法显著提升了模型在跨模态配准任务中的泛化性能,集成策略带来了小但一致的改进。 Conclusion: 简单的MIND特征预处理能有效提高深度学习配准模型的鲁棒性和跨模态泛化能力,是一种实用且有效的解决方案。 Abstract: Deep learning based deformable registration methods have become popular in recent years. However, their ability to generalize beyond training data distribution can be poor, significantly hindering their usability. LUMIR brain registration challenge for Learn2Reg 2025 aims to advance the field by evaluating the performance of the registration on contrasts and modalities different from those included in the training set. Here we describe our submission to the challenge, which proposes a very simple idea for significantly improving robustness by transforming the images into MIND feature space before feeding them into the model. In addition, a special ensembling strategy is proposed that shows a small but consistent improvement.

[271] EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction

Taoyu Wu,Yiyi Miao,Jiaxin Guo,Ziyan Chen,Sihang Zhao,Zhuoxiao Li,Zhe Tang,Baoru Huang,Limin Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为EndoWave的统一时空高斯点阵框架,用于解决机器人辅助微创手术中内窥镜视频3D重建面临的光度不一致、非刚性组织运动和视角相关高光等挑战。

Details Motivation: 由于内窥镜场景存在光度不一致、非刚性组织运动和视图依赖性高光等动态视觉伪影,仅依赖外观约束的传统3DGS方法容易误导优化过程,导致重建不准确。因此需要引入更鲁棒的几何与渲染约束机制。 Method: 采用统一的时空高斯表示,在4D域直接优化图元;引入基于光流的几何约束以增强时间一致性并有效约束场景3D结构;提出多分辨率有理正交小波监督,以分离内窥镜细节并提升渲染性能。 Result: 在EndoNeRF和StereoMIS两个真实手术数据集上的实验表明,EndoWave在重建质量和视觉准确性方面均优于基线方法,达到最先进水平。 Conclusion: EndoWave通过结合光流几何约束和多分辨率小波监督,显著提升了内窥镜视频的3D重建精度与稳定性,适用于复杂动态手术场景。 Abstract: In robot-assisted minimally invasive surgery, accurate 3D reconstruction from endoscopic video is vital for downstream tasks and improved outcomes. However, endoscopic scenarios present unique challenges, including photometric inconsistencies, non-rigid tissue motion, and view-dependent highlights. Most 3DGS-based methods that rely solely on appearance constraints for optimizing 3DGS are often insufficient in this context, as these dynamic visual artifacts can mislead the optimization process and lead to inaccurate reconstructions. To address these limitations, we present EndoWave, a unified spatio-temporal Gaussian Splatting framework by incorporating an optical flow-based geometric constraint and a multi-resolution rational wavelet supervision. First, we adopt a unified spatio-temporal Gaussian representation that directly optimizes primitives in a 4D domain. Second, we propose a geometric constraint derived from optical flow to enhance temporal coherence and effectively constrain the 3D structure of the scene. Third, we propose a multi-resolution rational orthogonal wavelet as a constraint, which can effectively separate the details of the endoscope and enhance the rendering performance. Extensive evaluations on two real surgical datasets, EndoNeRF and StereoMIS, demonstrate that our method EndoWave achieves state-of-the-art reconstruction quality and visual accuracy compared to the baseline method.

[272] Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang,Xuejing Liu,Sibo Song,Ruibing Hou,Hong Chang,Junyang Lin,Shuai Bai

Main category: cs.CV

TL;DR: 本文系统研究了多模态旋转位置编码(RoPE),提出了Multi-Head RoPE和MRoPE-Interleave两种简单有效的变体,通过位置设计与频率分配的优化,在无需修改模型结构的前提下显著提升了多模态理解性能。

Details Motivation: 现有研究缺乏对多模态位置编码的系统性分析,尤其是视觉-语言模型中位置信息的有效融合问题亟待解决。 Method: 分析了多模态RoPE的两个核心组件:位置设计和频率分配,提出了三个关键准则——位置一致性、全频率利用和保持文本先验,并基于此设计了MHRoPE和MRoPE-I方法。 Result: 所提方法在多种基准任务上均优于现有方法,尤其在细粒度多模态理解任务上有显著提升。 Conclusion: MHRoPE和MRoPE-I是即插即用、无需架构更改的有效多模态位置编码方案,为视觉-语言模型中的位置信息建模提供了新的设计准则。 Abstract: Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

[273] Residual Diffusion Bridge Model for Image Restoration

Hebaixu Wang,Jing Zhang,Haoyang Chen,Haonan Guo,Di Wang,Jiayi Ma,Bo Du

Main category: cs.CV

TL;DR: 提出残差扩散桥模型(RDBM),通过理论重构广义扩散桥的随机微分方程,利用分布残差调节噪声注入与去除,实现退化区域的自适应恢复并保护未退化区域,统一理解现有桥模型并验证其优越性。

Details Motivation: 现有扩散桥模型多被视为随机插值的简单变体,缺乏统一分析视角,且全局噪声处理会破坏未退化区域。 Method: 理论重构广义扩散桥的SDE,推导前向与反向过程解析公式,利用输入分布的残差调制噪声注入与去除过程。 Result: 现有桥模型均为RDBM的特例,RDBM在多种图像修复任务中实现SOTA性能。 Conclusion: RDBM提供了对扩散桥模型的统一数学理解,并通过残差调制实现更精准、自适应的图像修复。 Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at https://github.com/MiliLab/RDBM.

[274] Task-Agnostic Fusion of Time Series and Imagery for Earth Observation

Gianfranco Basile,Johannes Jakubik,Benedikt Blumenstiel,Thomas Brunschwiler,Juan Bernabe Moreno

Main category: cs.CV

TL;DR: 提出了一种任务无关的多模态融合框架,用于时间序列与单时间戳图像的融合,通过掩码相关学习实现跨模态生成,并在地球观测领域验证了其优越性能。

Details Motivation: 现有方法多为任务特定的融合策略,缺乏通用性,且难以实现跨模态生成;需要一种任务无关的统一多模态表示方法。 Method: 探索确定性和学习型时间序列量化策略,采用掩码相关学习目标,将图像和时间序列的离散token对齐到统一表示空间。 Result: 在地球观测任务中,模型能从卫星图像生成一致的全球温度剖面;下游任务平均R²提升6%,RMSE降低2%;相比基线方法R²提高50%,RMSE降低12%;并通过反事实实验和梯度敏感性分析验证模型鲁棒性。 Conclusion: 所提任务无关预训练框架在多模态融合与跨模态生成上表现优异,显著优于任务特定方法和基线模型,具备良好的通用性与鲁棒性。 Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6\% in R$^2$ and 2\% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.

[275] DeepSalt: Bridging Laboratory and Satellite Spectra through Domain Adaptation and Knowledge Distillation for Large-Scale Soil Salinity Estimation

Rupasree Dey,Abdul Matin,Everett Lewark,Tanjim Bin Faruk,Andrei Bachinin,Sam Leuthold,M. Francesca Cotrufo,Shrideep Pallickara,Sangmi Lee Pallickara

Main category: cs.CV

TL;DR: 本文提出了一种名为DeepSalt的深度学习光谱迁移框架,通过知识蒸馏和新型光谱适应单元,将实验室光谱信息迁移到卫星高光谱遥感中,实现大范围、高精度的土壤盐分估算。

Details Motivation: 土壤盐渍化威胁生态系统和农业,传统实验室光谱分析虽精确但难以扩展,卫星遥感覆盖广但解释性差,亟需一种结合两者优势的方法。 Method: 提出DeepSalt框架,采用知识蒸馏和新设计的光谱适应单元(Spectral Adaptation Unit),将在实验室高分辨率光谱中学到的知识迁移到卫星高光谱数据中,实现无需大量实地采样的大范围盐分监测。 Result: DeepSalt在多个实证基准中表现出显著优于无显式域适应方法的性能,能有效推广到未见过的地理区域,解释了大部分盐分变异。 Conclusion: DeepSalt成功弥合了实验室光谱与卫星遥感之间的鸿沟,实现了高精度、可解释且可扩展的土壤盐分遥感监测,具有广泛的应用前景。 Abstract: Soil salinization poses a significant threat to both ecosystems and agriculture because it limits plants' ability to absorb water and, in doing so, reduces crop productivity. This phenomenon alters the soil's spectral properties, creating a measurable relationship between salinity and light reflectance that enables remote monitoring. While laboratory spectroscopy provides precise measurements, its reliance on in-situ sampling limits scalability to regional or global levels. Conversely, hyperspectral satellite imagery enables wide-area observation but lacks the fine-grained interpretability of laboratory instruments. To bridge this gap, we introduce DeepSalt, a deep-learning-based spectral transfer framework that leverages knowledge distillation and a novel Spectral Adaptation Unit to transfer high-resolution spectral insights from laboratory-based spectroscopy to satellite-based hyperspectral sensing. Our approach eliminates the need for extensive ground sampling while enabling accurate, large-scale salinity estimation, as demonstrated through comprehensive empirical benchmarks. DeepSalt achieves significant performance gains over methods without explicit domain adaptation, underscoring the impact of the proposed Spectral Adaptation Unit and the knowledge distillation strategy. The model also effectively generalized to unseen geographic regions, explaining a substantial portion of the salinity variance.

[276] Note on the Construction of Structure Tensor

Josef Bigun,Fernado Alonso-Fernandez

Main category: cs.CV

TL;DR: 本文通过总最小二乘(TLS)谱线拟合的统一视角,重新审视了Bigun与Granlund(1987)及Granlund与Knutsson(1995)提出的两种结构张量构造方法,揭示其本质一致性,并指出后者中修正项的冗余性。

Details Motivation: 两种结构张量构造方法表面上差异显著,缺乏统一解释,限制了其推广与应用,需建立共同理论框架以揭示内在联系。 Method: 将两种结构张量构造方法均建模为对功率谱的总最小二乘(TLS)直线拟合问题,从该统一视角进行理论分析与比较。 Result: 证明两种方法在TLS框架下可高度统一;Granlund与Knutsson(1995)中的修正项可省略,从而保证张量半正定性;简化了特征值解释;允许使用非四相滤波器(如Gabor滤波器)和非角度谱划分。 Conclusion: 通过TLS谱拟合视角,实现了对不同结构张量方法的统一理解,消除了不必要的修正项,拓展了可用滤波器类型与谱布局,增强了结构张量的灵活性与适用性。 Abstract: This note presents a theoretical discussion of two structure tensor constructions: one proposed by Bigun and Granlund 1987, and the other by Granlund and Knutsson 1995. At first glance, these approaches may appear quite different--the former is implemented by averaging outer products of gradient filter responses, while the latter constructs the tensor from weighted outer products of tune-in frequency vectors of quadrature filters. We argue that when both constructions are viewed through the common lens of Total Least Squares (TLS) line fitting to the power spectrum, they can be reconciled to a large extent, and additional benefits emerge. From this perspective, the correction term introduced in Granlund and Knutsson 1995 becomes unnecessary. Omitting it ensures that the resulting tensor remains positive semi-definite, thereby simplifying the interpretation of its eigenvalues. Furthermore, this interpretation allows fitting more than a single 0rientation to the input by reinterpreting quadrature filter responses without relying on a structure tensor. It also removes the constraint that responses must originate strictly from quadrature filters, allowing the use of alternative filter types and non-angular tessellations. These alternatives include Gabor filters--which, although not strictly quadrature, are still suitable for structure tensor construction--even when they tessellate the spectrum in a Cartesian fashion, provided they are sufficiently concentrated.

[277] Fast Voxel-Wise Kinetic Modeling in Dynamic PET using a Physics-Informed CycleGAN

Christian Salomonsen,Samuel Kuttner,Michael Kampffmeyer,Robert Jenssen,Kristoffer Wickstrøm,Jong Chul Ye,Elisabeth Wetzer

Main category: cs.CV

TL;DR: 提出一种基于物理信息的CycleGAN方法,用于动态PET定量分析,减少对复杂且侵入性动脉输入函数估计的依赖。

Details Motivation: 简化动态对比增强MRI和动态PET量化中的动脉输入函数(AIF)估计过程,避免其复杂性和侵入性。 Method: 采用物理信息引导的CycleGAN模型,将无配对的模拟AIF与真实DCE-MRI数据结合,实现动态PET的AIF预测和参数图生成。 Result: 实验表明该方法能准确预测AIF,并生成与参考结果高度相似的参数图。 Conclusion: 所提出的物理信息CycleGAN在动态PET量化中表现出良好潜力,可有效替代传统复杂的AIF估计方法。 Abstract: Tracer kinetic modeling serves a vital role in diagnosis, treatment planning, tracer development and oncology, but burdens practitioners with complex and invasive arterial input function estimation (AIF). We adopt a physics-informed CycleGAN showing promise in DCE-MRI quantification to dynamic PET quantification. Our experiments demonstrate sound AIF predictions and parameter maps closely resembling the reference.

[278] DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios

Ziyu Wang,Wenhao Li,Ji Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于深度信息和2D检测的深度引导查询生成器(DQ3D),用于提升交通场景中的3D目标检测性能,通过融合历史检测结果解决部分遮挡问题,在nuScenes数据集上显著优于基线方法。

Details Motivation: 现有基于3D参考点生成查询的方法常因参考点远离目标物体而导致误检,且难以处理当前帧中部分遮挡的目标。 Method: 提出深度引导查询生成器(DQ3D),利用深度信息和2D检测结果确保查询点位于物体表面或内部;引入混合注意力机制,融合历史检测结果与深度引导查询形成混合查询。 Result: 在nuScenes数据集上,相比基线方法mAP提升了6.3%,NDS提升了4.3%。 Conclusion: DQ3D有效提高了3D目标检测的精度,尤其在处理遮挡和减少误检方面表现优越。 Abstract: 3D object detection from multi-view images in traffic scenarios has garnered significant attention in recent years. Many existing approaches rely on object queries that are generated from 3D reference points to localize objects. However, a limitation of these methods is that some reference points are often far from the target object, which can lead to false positive detections. In this paper, we propose a depth-guided query generator for 3D object detection (DQ3D) that leverages depth information and 2D detections to ensure that reference points are sampled from the surface or interior of the object. Furthermore, to address partially occluded objects in current frame, we introduce a hybrid attention mechanism that fuses historical detection results with depth-guided queries, thereby forming hybrid queries. Evaluation on the nuScenes dataset demonstrates that our method outperforms the baseline by 6.3\% in terms of mean Average Precision (mAP) and 4.3\% in the NuScenes Detection Score (NDS).

[279] Implicit Modeling for Transferability Estimation of Vision Foundation Models

Yaoyan Zheng,Huiqun Wang,Nan Zhou,Di Huang

Main category: cs.CV

TL;DR: 提出了一种新的迁移性建模框架ITM,结合DVA策略高效近似嵌入空间演化,显著提升了预训练模型在下游任务中的迁移性评估性能。

Details Motivation: 现有方法难以准确评估具有多样化架构、训练策略和任务对齐的新兴预训练模型的迁移性。 Method: 提出隐式迁移性建模(ITM)框架,并采用分而治之的变分近似(DVA)策略来高效模拟嵌入空间的演化过程。 Result: 在涵盖多种训练机制和模型类型的综合基准上,ITM在稳定性、有效性和效率方面均优于现有方法。 Conclusion: ITM能够广泛适用于不同模型和下游任务,显著提升迁移性估计的准确性和实用性。 Abstract: Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model's intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark--spanning extensive training regimes and a wider variety of model types--demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.

[280] AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes

Sixian Liu,Chen Xu,Qiang Wang,Donghai Shi,Yiwen Li

Main category: cs.CV

TL;DR: 提出了一种自适应门控融合(AG-Fusion)方法,通过在统一的BEV空间中选择性地融合相机和LiDAR特征,提升了复杂场景下的3D目标检测鲁棒性,并构建了新的挑战性数据集Excavator3D(E3D)进行验证。

Details Motivation: 现有相机-LiDAR融合方法在传感器退化或环境干扰等复杂场景下性能显著下降,亟需提升多模态融合的鲁棒性。 Method: 将双模态特征投影到统一的鸟瞰图(BEV)空间,采用基于窗口的注意力机制增强特征,并设计基于跨模态注意力的自适应门控融合模块,动态选择可靠信息进行融合。 Result: 在标准KITTI数据集上达到93.92%的准确率,在自建的挑战性E3D数据集上比基线方法提升24.88%。 Conclusion: AG-Fusion方法在复杂工业场景中表现出对不可靠模态信息的强鲁棒性,显著提升了多模态3D目标检测的稳定性与性能。 Abstract: Multimodal camera-LiDAR fusion technology has found extensive application in 3D object detection, demonstrating encouraging performance. However, existing methods exhibit significant performance degradation in challenging scenarios characterized by sensor degradation or environmental disturbances. We propose a novel Adaptive Gated Fusion (AG-Fusion) approach that selectively integrates cross-modal knowledge by identifying reliable patterns for robust detection in complex scenes. Specifically, we first project features from each modality into a unified BEV space and enhance them using a window-based attention mechanism. Subsequently, an adaptive gated fusion module based on cross-modal attention is designed to integrate these features into reliable BEV representations robust to challenging environments. Furthermore, we construct a new dataset named Excavator3D (E3D) focusing on challenging excavator operation scenarios to benchmark performance in complex conditions. Our method not only achieves competitive performance on the standard KITTI dataset with 93.92% accuracy, but also significantly outperforms the baseline by 24.88% on the challenging E3D dataset, demonstrating superior robustness to unreliable modal information in complex industrial scenes.

[281] Finding 3D Scene Analogies with Multimodal Foundation Models

Junho Kim,Young Min Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态基础模型的零样本、开放词汇方法,用于在3D场景间建立类比关系,通过混合神经表征实现从粗到细的场景对齐,并展示了在轨迹和路点迁移中的应用。

Details Motivation: 现有3D场景类比方法需要额外训练且受限于固定物体词汇,难以适应新环境;本文旨在实现无需训练、支持开放词汇的零样本场景对齐,以提升机器人在新环境中利用先验经验的能力。 Method: 提出一种混合神经表征方法,结合基于视觉-语言模型特征的稀疏图和基于3D形状基础模型的特征场;首先对齐稀疏图进行粗粒度匹配,再利用特征场细化对应关系,从而在零样本、开放词汇设置下寻找3D场景类比。 Result: 该方法能在复杂3D场景间建立精确的对应关系,在多个场景中实现了有效的轨迹与路点迁移,优于或可媲美需训练的现有方法。 Conclusion: 所提方法实现了无需训练的开放词汇3D场景类比,在轨迹和任务迁移方面具有潜力,为机器人跨场景适应与规划提供了有效工具。 Abstract: Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, which are smooth maps that align scene regions with common spatial relationships. These maps enable detailed transfer of trajectories or waypoints, potentially supporting demonstration transfer for imitation learning or task plan transfer across scenes. However, existing methods for the task require additional training and fixed object vocabularies. In this work, we propose to use multimodal foundation models for finding 3D scene analogies in a zero-shot, open-vocabulary setting. Central to our approach is a hybrid neural representation of scenes that consists of a sparse graph based on vision-language model features and a feature field derived from 3D shape foundation models. 3D scene analogies are then found in a coarse-to-fine manner, by first aligning the graph and refining the correspondence with feature fields. Our method can establish accurate correspondences between complex scenes, and we showcase applications in trajectory and waypoint transfer.

[282] Evaluation of Vision-LLMs in Surveillance Video

Pascal Benschop,Cristian Meo,Justin Dauwels,Jelte P. Mense

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型(VLMs)在零样本、语言驱动下的异常行为识别中的空间推理能力,通过将视频转为文本描述并利用文本蕴含打分来检测异常事件,并在UCF-Crime和RWF-2000数据集上评估了四个开源模型的表现,发现当前模型在简单且空间显著的事件中表现良好,但在噪声多或身份模糊的情况下效果下降,同时提出了增强空间接地能力的可行路径。

Details Motivation: 由于社会中摄像头广泛使用导致视频数据量巨大,人工监控难以应对,因此需要自动检测异常或犯罪事件以提升公共安全;而具身智能体对意外事件的识别依赖于其空间推理能力,故需探索VLMs在此类任务中的潜力。 Method: 将异常行为识别视为零样本、语言驱动的任务,通过将稀疏2D视频转换为文本描述,并利用预训练的小型视觉-大语言模型结合文本蕴含进行标签评分,从而实现异常检测;在UCF-Crime和RWF-2000数据集上评估四种开源模型,测试提示工程和隐私保护方法的影响。 Result: 实验表明部分模型可通过少样本示例提升准确率但可能增加误报,而隐私滤波(尤其是全身GAN变换)会引入不一致并降低准确性;当前VLMs在简单、空间显著的事件中表现较好,在空间线索嘈杂或身份被遮蔽时表现不佳。 Conclusion: 零样本、语言驱动的VLM pipeline 可作为具身式现实世界视频理解的可适应构建模块;未来可通过结构感知提示、跨片段轻量级空间记忆、场景图或3D姿态先验以及保留动作几何信息的隐私方法来增强空间接地能力。 Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition

[283] DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification

Lukas Bierling,Davide Pasero,Fleur Dolmans,Helia Ghasemi,Angelo Broere

Main category: cs.CV

TL;DR: 本文提出了DecoDINO,一种用于人-物接触点预测的三分支网络,在DECO基础上改进了对软表面、遮挡和儿童等复杂情况的处理能力,显著提升了二元接触F1分数并降低了测地误差,同时增加了语义标签输出。

Details Motivation: 现有方法如DECO仅支持二值接触图,且在软表面、遮挡、儿童及脚部误检方面表现不佳,难以满足高保真交互需求。 Method: 基于DECO框架构建三分支网络,采用两个DINOv2 ViT-g/14编码器、类别平衡损失权重和patch级交叉注意力机制,顶点特征经轻量MLP与softmax输出语义接触标签;尝试了视觉-语言模型但未采用。 Result: 在DAMON基准上,DecoDINO将二元接触F1分数提高7%,测地误差减半,并支持物体级语义标签;消融实验表明LoRA微调和双编码器是关键因素;在DAMON挑战赛两项任务中均优于基线。 Conclusion: DecoDINO显著提升了野外环境下的人-物接触预测精度与语义丰富性,为AR/VR、机器人和行为模拟提供了更可靠的基础。 Abstract: Accurate vertex-level contact prediction between humans and surrounding objects is a prerequisite for high fidelity human object interaction models used in robotics, AR/VR, and behavioral simulation. DECO was the first in the wild estimator for this task but is limited to binary contact maps and struggles with soft surfaces, occlusions, children, and false-positive foot contacts. We address these issues and introduce DecoDINO, a three-branch network based on DECO's framework. It uses two DINOv2 ViT-g/14 encoders, class-balanced loss weighting to reduce bias, and patch-level cross-attention for improved local reasoning. Vertex features are finally passed through a lightweight MLP with a softmax to assign semantic contact labels. We also tested a vision-language model (VLM) to integrate text features, but the simpler architecture performed better and was used instead. On the DAMON benchmark, DecoDINO (i) raises the binary-contact F1 score by 7$\%$, (ii) halves the geodesic error, and (iii) augments predictions with object-level semantic labels. Ablation studies show that LoRA fine-tuning and the dual encoders are key to these improvements. DecoDINO outperformed the challenge baseline in both tasks of the DAMON Challenge. Our code is available at https://github.com/DavidePasero/deco/tree/main.

[284] VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Hoonhee Cho,Jae-Young Kang,Giwon Lee,Hyemin Yang,Heejun Park,Seokwoo Jung,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 提出VR-Drive,一种通过联合学习3D场景重建来实现规划感知视图合成的端到端自动驾驶框架,有效提升在不同相机视角下的泛化能力和鲁棒性。

Details Motivation: 解决端到端自动驾驶中因车辆配置多样导致的相机视角变化带来的鲁棒性问题,提升在新视角下的驾驶决策能力。 Method: 提出VR-Drive框架,联合学习3D场景重建作为辅助任务,采用前馈推理策略支持训练时在线增强;引入视点混合记忆库和视点一致的知识蒸馏策略以增强多视角时空一致性。 Result: VR-Drive在新视角下显著提升规划性能,减轻合成噪声影响,并发布了一个用于评估新视角下E2E-AD性能的新基准数据集。 Conclusion: VR-Drive是一种可扩展且鲁棒的端到端自动驾驶解决方案,能够有效应对真实世界中多变的相机视角挑战。 Abstract: End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robustness to varying camera viewpoints, a common real-world challenge due to diverse vehicle configurations, remains an open problem. In this work, we propose VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further improve viewpoint consistency, we introduce a viewpoint-mixed memory bank that facilitates temporal interaction across multiple viewpoints and a viewpoint-consistent distillation strategy that transfers knowledge from original to synthesized views. Trained in a fully end-to-end manner, VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts. In addition, we release a new benchmark dataset to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis. Our results demonstrate that VR-Drive is a scalable and robust solution for the real-world deployment of end-to-end autonomous driving systems.

[285] Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment

Hongyi Wang,Zhengjie Zhu,Jiabo Ma,Fang Wang,Yue Shi,Bo Luo,Jili Wang,Qiuyu Cai,Xiuming Zhang,Yen-Wei Chen,Lanfen Lin,Hao Chen

Main category: cs.CV

TL;DR: PathSearch 是一种结合细粒度马赛克表征与全局幻灯片嵌入的病理图像检索框架,通过视觉-语言对比学习实现高效的图像到图像和跨模态文本到图像检索,在多个公开和内部数据集上表现优异,并提升病理医生的诊断准确性与一致性。

Details Motivation: 全切片图像(WSI)的高像素规模和语义差异细微使得有效检索具有挑战性,现有方法难以兼顾细节特征与高层语义,亟需一个既能捕捉形态学细节又能支持灵活检索的统一框架。 Method: 提出 PathSearch 框架,采用细粒度注意力马赛克表示与全局幻灯片嵌入,通过视觉-语言对比学习进行对齐;在 6,926 个幻灯片-报告对上训练,支持基于马赛克的图像检索和基于文本查询的多模态检索。 Result: 在四个公开数据集和三个内部队列中验证,涵盖器官定位、肿瘤分型、良恶性区分和分级任务;结果显示优于传统图像检索方法,并在多中心读片研究中提升诊断准确率、信心水平和观察者间一致性。 Conclusion: PathSearch 是一个可扩展且通用的数字病理检索解决方案,能够有效整合局部细节与全局语义,支持精准诊断、一致性提升和示例驱动教学。 Abstract: The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands out, enabling pathologists to identify morphologically and semantically similar cases, thereby supporting precise diagnoses, enhancing consistency across observers, and assisting example-based education. However, effective retrieval of whole slide images (WSIs) remains challenging due to their gigapixel scale and the difficulty of capturing subtle semantic differences amid abundant irrelevant content. To overcome these challenges, we present PathSearch, a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning. Trained on a corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained morphological cues and high-level semantic patterns to enable accurate and flexible retrieval. The framework supports two key functionalities: (1) mosaic-based image-to-image retrieval, ensuring accurate and efficient slide research; and (2) multi-modal retrieval, where text queries can directly retrieve relevant slides. PathSearch was rigorously evaluated on four public pathology datasets and three in-house cohorts, covering tasks including anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination, and grading across diverse organs such as breast, lung, kidney, liver, and stomach. External results show that PathSearch outperforms traditional image-to-image retrieval frameworks. A multi-center reader study further demonstrates that PathSearch improves diagnostic accuracy, boosts confidence, and enhances inter-observer agreement among pathologists in real clinical scenarios. These results establish PathSearch as a scalable and generalizable retrieval solution for digital pathology.

[286] Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Razaib Tariq,Minji Heo,Simon S. Woo,Shahroz Tariq

Main category: cs.CV

TL;DR: 本研究系统评估了现有最先进深度伪造检测器在受莫尔条纹影响视频上的性能,发现莫尔条纹可使检测准确率下降高达25.4%,而去莫尔条纹方法反而进一步降低准确率。作者构建了包含真实和合成莫尔条纹的DeepMoiréFake(DMF)数据集,强调需开发能应对现实复杂干扰(如莫尔条纹、压缩、模糊等)的鲁棒检测模型。

Details Motivation: 莫尔条纹是智能手机拍摄数字屏幕内容时常见的视觉干扰,在真实场景中广泛存在,但其对深度伪造检测的影响尚未被充分研究。现有检测器多在理想条件下评估,缺乏对这类实际干扰的鲁棒性,导致性能严重下降,因此亟需评估并解决该问题。 Method: 收集了来自Celeb-DF、DFD、DFDC、UADFV和FF++等五个主流数据集的12,832个视频(共35.64小时),在多种真实条件(不同屏幕、手机、光照、角度)下采集以引入真实莫尔条纹;构建新的DeepMoiréFake(DMF)数据集,并采用两种合成莫尔条纹生成技术进行补充实验;在15种最先进的检测器上评估莫尔条纹的影响及去莫尔方法的效果。 Result: 实验显示,真实莫尔条纹使15种顶级检测器性能最多下降25.4%,合成莫尔条纹导致平均准确率下降21.4%;令人意外的是,使用去莫尔方法处理后,检测准确率进一步下降达17.2%,表明当前缓解策略可能适得其反。 Conclusion: 莫尔条纹显著削弱现有深度伪造检测器的性能,且当前去莫尔技术无法有效缓解甚至恶化问题,凸显出现有模型在现实复杂环境中的脆弱性;必须开发能联合应对莫尔条纹、压缩、模糊等多种退化因素的鲁棒检测方法;所提出的DMF数据集为推动从实验室到实际应用的过渡提供了重要资源。 Abstract: Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir\'e artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moir\'e-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moir\'e patterns on deepfake detection, we conducted additional experiments using our DeepMoir\'eFake, referred to as (DMF) dataset and two synthetic Moir\'e generation techniques. Across 15 top-performing detectors, our results show that Moir\'e artifacts degrade performance by as much as 25.4%, while synthetically generated Moir\'e patterns lead to a 21.4% drop in accuracy. Surprisingly, demoir\'eing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moir\'e distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.

[287] Autoregressive Styled Text Image Generation, but Make it Reliable

Carmine Zaccagnino,Fabio Quattrini,Vittorio Pippi,Silvia Cascianelli,Alessio Tonioni,Rita Cucchiara

Main category: cs.CV

TL;DR: 提出了一种新的手写文本生成方法Eruku,通过多模态提示条件生成框架和无分类器引导策略,改善了内容可控性和风格保真度。

Details Motivation: 现有自回归方法需要额外输入、缺乏停止机制且易产生重复和视觉伪影,难以实现高质量的手写文本生成。 Method: 将手写文本生成视为多模态提示条件生成任务,引入特殊文本输入标记以更好对齐视觉标记,并采用无分类器引导策略优化生成过程。 Result: Eruku方法相比先前方法所需输入更少,对未见风格泛化能力更强,且能更忠实遵循文本提示,提升内容一致性。 Conclusion: Eruku在减少输入依赖、增强风格泛化和内容控制方面优于现有方法,为风格化文本图像生成提供了更优解决方案。 Abstract: Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.

[288] Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation

Stefan M. Fischer,Johannes Kiechle,Laura Daza,Lina Felsner,Richard Osuala,Daniel M. Lang,Karim Lekadir,Jan C. Peeken,Julia A. Schnabel

Main category: cs.CV

TL;DR: 本文提出了一种名为“渐进增大补丁尺寸”(Progressive Growing of Patch Size)的自动课程学习方法,用于3D医学图像分割,通过在训练过程中逐步增加补丁尺寸,提升了类别平衡性并加速了收敛。

Details Motivation: 传统的固定补丁尺寸采样方法在处理类别不平衡(如病灶分割)任务时存在训练效率低和性能受限的问题,因此需要一种更高效且通用的训练策略。 Method: 在模型训练过程中动态、逐步地增加输入补丁的尺寸,形成一种课程学习机制,在早期使用小补丁以提升类别平衡和训练稳定性,后期逐步增大补丁以捕获更大上下文信息。 Result: 在15个不同的3D医学图像分割任务中验证了该方法的有效性:资源节约模式将训练时间减少至44%且性能相当;性能模式平均Dice分数相对提升1.28%,同时训练时间减少至89%,并在所有任务中均超越基线,尤其在高度不平衡的任务中效果显著。该方法还降低了性能方差,增强了模型比较的可信度。 Conclusion: 渐进增大补丁尺寸是一种简单而有效的通用策略,兼容UNet、UNETR、SwinUNETR等多种分割架构,能同时提升分割性能和训练效率,具有广泛的应用潜力。 Abstract: In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.

[289] A Video Is Not Worth a Thousand Words

Sam Pollard,Michael Wray

Main category: cs.CV

TL;DR: 提出基于Shapley值的特征归因和模态评分方法,评估多模态模型在多选视频问答中的表现,发现模型过度依赖文本且任务退化为忽略干扰项的能力。

Details Motivation: 随着对视觉语言模型(VLMs)依赖增加,现有研究关注提升数据集难度和模型上下文长度,但存在文本主导的担忧,且模态间交互研究不足,需新方法衡量多模态复杂性的发展方向。 Method: 基于Shapley值计算可任意定义的特征归因和模态评分,将视频帧和文本元素视为同级特征,将多选VQA任务建模为视频、问题和答案三模态交互,并在6个VLM模型和4个数据集上进行比较分析。 Result: 实验结果显示当前VLM模型严重依赖文本信息,多选VQA任务的表现主要取决于模型忽略干扰选项的能力,而非真正的多模态理解。 Conclusion: 当前多模态模型在多选视频问答中存在文本主导问题,任务设计可能导致模型仅学习忽略干扰项,需重新思考评估方式以推动真正跨模态理解的发展。 Abstract: As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model's ability to ignore distractors. Code available at https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.

[290] hYOLO Model: Enhancing Object Classification with Hierarchical Context in YOLOv8

Veska Tsenkova,Peter Stanchev,Daniel Petrov,Deyan Lazarov

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLO模型家族的端到端分层图像检测与分类模型,通过引入新的分层架构、改进的损失函数和针对分层结构设计的性能度量,有效利用了现实世界中物体的自然层次结构。

Details Motivation: 现有的CNN分类方法主要集中在扁平分类,忽略了物体之间存在的自然层次关系,而这种关系有助于提升分类性能和上下文理解,并控制错误严重性。 Method: 在YOLO模型基础上构建分层模型,提出新的分层网络架构、修改损失函数,并设计适用于分层分类的评估指标,在两种不同层次化数据集上进行训练与评估。 Result: 实验结果表明,所提方法能有效捕捉现实物体间的层次结构,优于传统扁平分类方法,尤其在考虑视觉相似性的类别组织下表现更优。 Conclusion: 该分层模型能够更好地建模真实世界的类别关系,提升了图像分类与检测的准确性与语义合理性。 Abstract: Current convolution neural network (CNN) classification methods are predominantly focused on flat classification which aims solely to identify a specified object within an image. However, real-world objects often possess a natural hierarchical organization that can significantly help classification tasks. Capturing the presence of relations between objects enables better contextual understanding as well as control over the severity of mistakes. Considering these aspects, this paper proposes an end-to-end hierarchical model for image detection and classification built upon the YOLO model family. A novel hierarchical architecture, a modified loss function, and a performance metric tailored to the hierarchical nature of the model are introduced. The proposed model is trained and evaluated on two different hierarchical categorizations of the same dataset: a systematic categorization that disregards visual similarities between objects and a categorization accounting for common visual characteristics across classes. The results illustrate how the suggested methodology addresses the inherent hierarchical structure present in real-world objects, which conventional flat classification algorithms often overlook.

[291] Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

Ruoyu Wang,Beier Zhu,Junzhi Li,Liangyu Yuan,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的单步SDE求解器AdaSDE,通过引入可学习系数动态调节误差校正强度,兼顾了ODE的效率和SDE的误差鲁棒性,在低采样步数下实现了先进的生成质量。

Details Motivation: 扩散生成模型在计算速度与生成质量之间存在权衡,现有ODE和SDE求解器分别存在累积梯度误差和离散化误差放大的问题。 Method: 提出AdaSDE,一种基于单步SDE的求解器,引入一个轻量级蒸馏估计的可学习系数,动态调节误差校正强度,并可与现有求解器结合使用。 Result: 在5次网络函数评估(NFE)下,AdaSDE在CIFAR-10上取得4.18的FID分数,在FFHQ上为8.05,在LSUN Bedroom上为6.96,表现优于现有方法。 Conclusion: AdaSDE有效平衡了生成效率与样本质量,通过动态误差调节机制提升了低步数下的采样性能,具有良好的通用性和集成能力。 Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.

[292] MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

Haochen Zhao,Yuyao Kong,Yongxiu Xu,Gaopeng Gou,Hongbo Xu,Yubin Wang,Haoliang Zhang

Main category: cs.CV

TL;DR: 本文提出了MMSD3.0,一个全新的多图像讽刺检测基准,并设计了跨图像推理模型(CIRM)以捕捉多图像间的潜在语义关系,结合细粒度的跨模态融合机制,在单图和多图场景下均实现了最先进的性能。

Details Motivation: 现有讽刺检测研究主要集中于单图像场景,忽略了多图像间语义与情感关联,难以应对现实世界中由多图像线索触发的讽刺现象,因此需要更贴近实际的多图像基准与建模方法。 Method: 提出MMSD3.0多图像数据集,并构建跨图像推理模型(CIRM),通过有目标的跨图像序列建模捕捉图像间隐含联系;引入基于文本-图像对应关系的相关性引导细粒度跨模态融合机制,减少信息损失。 Result: 在MMSD、MMSD2.0和MMSD3.0数据集上实验表明,CIRM在单图与多图场景下均达到最优性能,MMSD3.0作为新基准能更真实反映现实情况。 Conclusion: MMSD3.0有效填补了多图像讽刺检测的研究空白,CIRM结合跨图像推理与精细融合机制,显著提升了多模态讽刺检测的性能,具有广泛适用性。 Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.

[293] MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification

Yingying Feng,Jie Li,Jie Hu,Yukang Zhang,Lei Tan,Jiayi Ji

Main category: cs.CV

TL;DR: 提出MDReID,一种灵活的任意到任意图像级重识别框架,可在模态匹配和不匹配场景下工作。

Details Motivation: 现实世界中的物体重识别系统常面临查询与图库图像来自不同传感器(如RGB、NIR、TIR)导致的模态不一致问题,而现有方法大多假设模态匹配,限制了实际应用中的鲁棒性和可扩展性。 Method: 提出Modality Decoupling Learning (MDL) 将模态特征分解为共享和特定两部分,并通过Modality-aware Metric Learning (MML) 增强跨模态判别能力。 Result: 在RGBNT201、RGBNT100和MSVR310三个多模态ReID基准上实验表明,MDReID在模态匹配下mAP提升9.8%、3.0%、11.5%,在不匹配下平均提升3.4%、11.8%、10.9%。 Conclusion: MDReID能有效应对模态不一致性,在多种ReID场景中表现出优越性能和广泛适用性。 Abstract: Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in general modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{https://github.com/stone96123/MDReID}.

[294] ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation

Jiahao Chang,Chongjie Ye,Yushuang Wu,Yuantao Chen,Yidan Zhang,Zhongjin Luo,Chenghong Li,Yihao Zhi,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出ReconViaGen,通过整合重建先验到生成框架中,解决基于扩散的3D生成方法在多视图3D重建中一致性不足的问题,实现了全局结构和局部细节均与输入视图一致的完整且准确的3D模型重建。

Details Motivation: 现有基于多视图的3D重建方法依赖视图间充分重叠,面对遮挡和稀疏覆盖时易出现严重缺失;而扩散生成模型虽能“幻想”缺失部分,但其随机性导致生成结果不一致、不可靠。因此,亟需提升生成过程的可控性和一致性以有效融合生成先验。 Method: 本文分析了扩散模型在3D重建中一致性差的原因:(a) 多视图特征提取中跨视图关联构建不足;(b) 局部细节生成时去噪过程可控性差。为此提出ReconViaGen框架,将重建先验融入生成模型,并设计多种策略来增强跨视图一致性与细节可控性。 Result: 实验表明,ReconViaGen能够在全局结构和局部几何/纹理细节上生成与输入视图高度一致的完整且精确的3D模型,显著优于以往方法。 Conclusion: ReconViaGen成功地将重建先验与扩散生成模型结合,解决了生成一致性与可控性问题,为利用生成先验进行高质量多视图3D重建提供了有效方案。 Abstract: Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions, and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local details.Project page: https://jiahao620.github.io/reconviagen.

[295] Multitask Multimodal Self-Supervised Learning for Medical Images

Cristian Simionescu

Main category: cs.CV

TL;DR: 本论文提出了一种名为Medformer的新型神经网络架构,通过自监督学习和领域自适应方法,减少医学图像分析对大规模标注数据集的依赖。

Details Motivation: 医学图像标注数据集通常受限于专家标注成本、隐私和法律问题,导致深度学习模型训练困难。因此,亟需减少对大量标注数据的依赖。 Method: 设计Medformer网络架构,支持多任务学习与深度领域自适应,采用自监督预训练和新颖的前置任务,利用未标注数据提取有意义特征,并在MedMNIST等数据集上验证。 Result: Medformer能够有效处理多种模态和尺寸的医学图像(如2D X光和3D MRI),在多个下游任务中展现出良好的泛化能力,显著降低对标注数据的需求。 Conclusion: 该研究为医学图像分析提供了一个可扩展、可适应的深度学习框架,推动了更高效、准确的医疗诊断工具的发展。 Abstract: This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and constrained by privacy and legal issues. By focusing on the development of self-supervised learning techniques and domain adaptation methods, this research aims to circumvent these limitations, presenting a novel approach to enhance the utility and efficacy of deep learning in medical imaging. Central to this thesis is the development of the Medformer, an innovative neural network architecture designed for multitask learning and deep domain adaptation. This model is adept at pre-training on diverse medical image datasets, handling varying sizes and modalities, and is equipped with a dynamic input-output adaptation mechanism. This enables efficient processing and integration of a wide range of medical image types, from 2D X-rays to complex 3D MRIs, thus mitigating the dependency on large labeled datasets. Further, the thesis explores the current state of self-supervised learning in medical imaging. It introduces novel pretext tasks that are capable of extracting meaningful information from unlabeled data, significantly advancing the model's interpretative abilities. This approach is validated through rigorous experimentation, including the use of the MedMNIST dataset, demonstrating the model's proficiency in learning generalized features applicable to various downstream tasks. In summary, this thesis contributes to the advancement of medical image analysis by offering a scalable, adaptable framework that reduces reliance on labeled data. It paves the way for more accurate, efficient diagnostic tools in healthcare, signifying a major step forward in the application of deep learning in medical imaging.

[296] Interpretable Tile-Based Classification of Paclitaxel Exposure

Sean Fletcher,Gabby Scott,Douglas Currie,Xin Zhang,Yuqi Song,Bruce MacLeod

Main category: cs.CV

TL;DR: 提出一种基于局部图像块的分类方法,显著提升了紫杉醇暴露分类的准确率,并通过可视化分析增强了模型可解释性。

Details Motivation: 传统全图模型难以捕捉紫杉醇暴露下细胞细微变化,需要更有效的分类方法。 Method: 采用图像分块-聚合策略,对相位对比显微图像的局部区域进行建模,并结合各块输出得到整体图像标签。 Result: 在基准数据集上达到最先进的分类精度,比已有基线提升约20个百分点,交叉验证结果验证了性能提升的稳定性。 Conclusion: 分块策略能有效提升医学图像分类性能,结合可视化分析有助于理解模型行为,为未来鲁棒性研究提供方向。 Abstract: Medical image analysis is central to drug discovery and preclinical evaluation, where scalable, objective readouts can accelerate decision-making. We address classification of paclitaxel (Taxol) exposure from phase-contrast microscopy of C6 glioma cells -- a task with subtle dose differences that challenges full-image models. We propose a simple tiling-and-aggregation pipeline that operates on local patches and combines tile outputs into an image label, achieving state-of-the-art accuracy on the benchmark dataset and improving over the published baseline by around 20 percentage points, with trends confirmed by cross-validation. To understand why tiling is effective, we further apply Grad-CAM and Score-CAM and attention analyses, which enhance model interpretability and point toward robustness-oriented directions for future medical image research. Code is released to facilitate reproduction and extension.

[297] PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking

Yifan Jiao,Xinran Liu,Xiaoqiong Liu,Xiaohui Yuan,Heng Fan,Libo Zhang

Main category: cs.CV

TL;DR: 本文提出了一个名为PlanarTrack的大规模高质量平面跟踪基准,包含1,150个序列和超过733K帧,用于全面评估短期和长期跟踪性能。

Details Motivation: 由于缺乏大规模平台,平面跟踪在深度学习时代的发展受到限制,因此需要一个更具挑战性和多样性的基准来推动该领域发展。 Method: 构建了一个包含短时和长时视频的大规模数据集,所有帧均通过四个角点手动标注,并经过多轮精细检查与优化,确保标注质量;每个序列仅包含唯一目标以增强目标多样性。 Result: PlanarTrack是目前最大、最多样且最具挑战性的平面跟踪数据集;对10种代表性平面跟踪器的评估表明,现有方法在该数据集上性能显著下降。 Conclusion: PlanarTrack为平面跟踪研究提供了重要资源,揭示了当前方法的局限性,表明该领域仍需进一步研究和改进。 Abstract: Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose PlanarTrack, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1,150 sequences with over 733K frames, including 1,000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Our data and results will be released at https://github.com/HengLan/PlanarTrack

[298] An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping

Songxi Yang,Tang Sui,Qunying Huang

Main category: cs.CV

TL;DR: 本研究提出了一种高效的LSSR框架,用于遥感图像超分辨率重建,基于预训练的Stable Diffusion模型,结合多模态辅助信息和SAR引导,显著提升了作物边界识别与下游任务性能。

Details Motivation: 现有扩散模型在遥感超分辨率中面临训练成本高、推理慢、缺乏真实世界约束利用及下游任务评估不足的问题。 Method: 基于冻结的预训练Stable Diffusion,引入跨模态注意力机制,融合数字高程模型、土地覆盖、月份和SAR等辅助信息,并采用适配器和定制的傅里叶NDVI损失函数以平衡空间细节与光谱保真度。 Result: LSSR在30m到10m超分任务中达到SOTA性能,RGB和IR波段的PSNR/SSIM分别为32.63/0.84和23.99/0.78,NDVI MSE为0.042,推理速度为0.39秒/图像;在NASA HLS数据上实现更可靠的作物分类(F1: 0.86 vs Sentinel-2的0.85)。 Conclusion: LSSR通过有效融合多源辅助信息和预训练模型,在高效推理的同时提升遥感图像超分辨率质量及其在农业精细管理中的应用潜力。 Abstract: Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been applied to remote sensing super resolution (RSSR), yet several challenges exist. First, diffusion models are effective but require expensive training from scratch resources and have slow inference speeds. Second, current methods have limited utilization of auxiliary information as real-world constraints to reconstruct scientifically realistic images. Finally, most current methods lack evaluation on downstream tasks. In this study, we present a efficient LSSR framework for RSSR, supported by a new multimodal dataset of paired 30 m Landsat 8 and 10 m Sentinel 2 imagery. Built on frozen pretrained Stable Diffusion, LSSR integrates crossmodal attention with auxiliary knowledge (Digital Elevation Model, land cover, month) and Synthetic Aperture Radar guidance, enhanced by adapters and a tailored Fourier NDVI loss to balance spatial details and spectral fidelity. Extensive experiments demonstrate that LSSR significantly improves crop boundary delineation and recovery, achieving state-of-the-art performance with Peak Signal-to-Noise Ratio/Structural Similarity Index Measure of 32.63/0.84 (RGB) and 23.99/0.78 (IR), and the lowest NDVI Mean Squared Error (0.042), while maintaining efficient inference (0.39 sec/image). Moreover, LSSR transfers effectively to NASA Harmonized Landsat and Sentinel (HLS) super resolution, yielding more reliable crop classification (F1: 0.86) than Sentinel-2 (F1: 0.85). These results highlight the potential of RSSR to advance precision agriculture.

[299] VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

Lu Dong,Haiyu Zhang,Han Lin,Ziang Yan,Xiangyu Zeng,Hongjie Zhang,Yifei Huang,Yi Wang,Zhen-Hua Ling,Limin Wang,Yali Wang

Main category: cs.CV

TL;DR: 本文提出VideoTG-R1,一种结合反射边界标注和课程强化学习的新框架,用于解决视频时序定位中的部分标注和难样本问题,在仅用10%数据和21%计算成本下优于全数据训练方法。

Details Motivation: 现有基于多模态大模型的视频时序定位方法在强化学习训练中忽视了训练样本的质量和难度问题,包括部分标注带来的模糊监督和难样本导致的学习效率低下。 Method: 提出Boundary Reflection Agent识别并过滤部分标注样本,减少模糊性;设计Difficulty Estimation Agent评估样本难度,并采用动态掩码的课程强化学习策略逐步训练难样本。 Result: 在VTG和 grounded VideoQA 任务上验证了方法有效性,仅使用10%训练数据和21%计算预算时,性能超过全数据训练的基线方法(GRPO和SFT)。 Conclusion: VideoTG-R1通过处理样本质量和难度问题,实现了高效、鲁棒的视频时序定位,显著降低了对大量标注数据和计算资源的依赖。 Abstract: Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.

[300] Color and Frequency Correction for Image Colorization

Yun Kai Zhuang

Main category: cs.CV

TL;DR: 本文针对DDColor图像着色模型在特定频段的局限性和输入维度不足导致的色彩偏差问题,提出了两种优化方案并结合使用,显著提升了PSNR和SSIM等图像质量指标。

Details Motivation: 现有DDColor模型在某些频率波段存在局限性,并因输入维度不足导致颜色偏差,影响着色效果。 Method: 构建了两种优化方案并将其结合,对DDColor模型进行重新优化,以提升图像着色性能。 Result: 结合优化方案后,DDColor着色后的图像在PSNR和SSIM等指标上均有性能提升。 Conclusion: 所提出的联合优化策略有效改善了DDColor模型的着色质量,缓解了频段局限和色彩偏差问题。 Abstract: The project has carried out the re-optimization of image coloring in accordance with the existing Autocolorization direction model DDColor. For the experiments on the existing weights of DDColor, we found that it has limitations in some frequency bands and the color cast problem caused by insufficient input dimension. We construct two optimization schemes and combine them, which achieves the performance improvement of indicators such as PSNR and SSIM of the images after DDColor.

[301] Symmetria: A Synthetic Dataset for Learning in Point Clouds

Ivan Sipiran,Gustavo Santelices,Lucas Oyarzún,Andrea Ranieri,Chiara Romanengo,Silvia Biasotti,Bianca Falcidieno

Main category: cs.CV

TL;DR: 本文提出了Symmetria,一个基于对称性原理构建的、可任意规模生成的点云数据集,解决了点云学习中大规模标注数据稀缺的问题。

Details Motivation: 由于缺乏大规模高质量点云数据集,点云自监督学习面临挑战,现有方法受限于数据量和精确真值的获取。 Method: 通过数学公式驱动的方式生成具有已知对称结构的多样化点云形状,确保精确真值的绝对可用性,并支持灵活扩展至新任务和模态。 Result: 实验表明,该数据集在点云自监督预训练中效果显著,预训练模型在分类、分割等下游任务中表现优异,具备良好的少样本学习能力,并能有效微调用于真实世界物体分类;同时提供了对称性检测的新基准任务。 Conclusion: Symmetria为点云学习提供了一个可扩展、数据高效且易于获取真值的解决方案,推动了相关领域的研究与应用,且数据集、代码和生成工具已公开。 Abstract: Unlike image or text domains that benefit from an abundance of large-scale datasets, point cloud learning techniques frequently encounter limitations due to the scarcity of extensive datasets. To overcome this limitation, we present Symmetria, a formula-driven dataset that can be generated at any arbitrary scale. By construction, it ensures the absolute availability of precise ground truth, promotes data-efficient experimentation by requiring fewer samples, enables broad generalization across diverse geometric settings, and offers easy extensibility to new tasks and modalities. Using the concept of symmetry, we create shapes with known structure and high variability, enabling neural networks to learn point cloud features effectively. Our results demonstrate that this dataset is highly effective for point cloud self-supervised pre-training, yielding models with strong performance in downstream tasks such as classification and segmentation, which also show good few-shot learning capabilities. Additionally, our dataset can support fine-tuning models to classify real-world objects, highlighting our approach's practical utility and application. We also introduce a challenging task for symmetry detection and provide a benchmark for baseline comparisons. A significant advantage of our approach is the public availability of the dataset, the accompanying code, and the ability to generate very large collections, promoting further research and innovation in point cloud learning.

[302] Towards Generalisable Foundation Models for 3D Brain MRI

Moona Mazher,Geoff J. M. Parker,Daniel C. Alexander

Main category: cs.CV

TL;DR: BrainFound 是一个基于 DINO-v2 的自监督基础模型,用于脑部 MRI 分析,通过整合3D体积信息和多模态输入,在疾病检测和图像分割等任务中表现出色,尤其适用于标签稀缺和多对比度场景。

Details Motivation: 现有的脑部MRI分析方法多依赖单切片模式和大量标注数据,限制了在不同成像协议和临床场景下的泛化能力。需要一种能够利用大规模无标签数据进行通用特征学习的3D自监督模型。 Method: 在DINO-v2基础上扩展,引入连续MRI切片的体素信息以建模完整的3D脑结构,并支持单模态和多模态输入,采用自监督学习方式训练。 Result: BrainFound 在多种下游任务(如疾病检测、图像分割)中 consistently 优于现有的自监督预训练方法和有监督基线,特别是在标签稀缺和多对比度设置下表现突出,且能跨不同成像协议泛化。 Conclusion: BrainFound 是一种可扩展、实用的3D神经影像分析基础模型,显著减少对专家标注的依赖,具有广泛的临床应用和研究创新潜力。 Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

[303] Quality-controlled registration of urban MLS point clouds reducing drift effects by adaptive fragmentation

Marco Antonio Ortiz Rincon,Yihui Yang,Christoph Holst

Main category: cs.CV

TL;DR: 提出了一种高效准确配准大规模移动激光扫描点云的工作流,适用于复杂城市环境。

Details Motivation: 解决城市环境中点云密度、噪声和遮挡差异带来的配准难题。 Method: 采用半球检测(SSC)预处理技术分割轨迹数据,并提出基于平面体素的广义ICP(PV-GICP)进行精细配准。 Result: 在慕尼黑市区实测数据上实现了低于0.01米的平均配准精度,计算时间减少50%以上。 Conclusion: 该工作流显著提升了城市三维建模的自动化水平,适用于城市规划与动态监测。 Abstract: This study presents a novel workflow designed to efficiently and accurately register large-scale mobile laser scanning (MLS) point clouds to a target model point cloud in urban street scenarios. This workflow specifically targets the complexities inherent in urban environments and adeptly addresses the challenges of integrating point clouds that vary in density, noise characteristics, and occlusion scenarios, which are common in bustling city centers. Two methodological advancements are introduced. First, the proposed Semi-sphere Check (SSC) preprocessing technique optimally fragments MLS trajectory data by identifying mutually orthogonal planar surfaces. This step reduces the impact of MLS drift on the accuracy of the entire point cloud registration, while ensuring sufficient geometric features within each fragment to avoid local minima. Second, we propose Planar Voxel-based Generalized Iterative Closest Point (PV-GICP), a fine registration method that selectively utilizes planar surfaces within voxel partitions. This pre-process strategy not only improves registration accuracy but also reduces computation time by more than 50% compared to conventional point-to-plane ICP methods. Experiments on real-world datasets from Munich's inner city demonstrate that our workflow achieves sub-0.01 m average registration accuracy while significantly shortening processing times. The results underscore the potential of the proposed methods to advance automated 3D urban modeling and updating, with direct applications in urban planning, infrastructure management, and dynamic city monitoring.

[304] MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans

Ahmet Serdar Karadeniz,Dimitrios Mallis,Danila Rukhovich,Kseniya Cherenkova,Anis Kacem,Djamila Aouada

Main category: cs.CV

TL;DR: 提出一种受人类设计启发的新型CAD逆向工程方法,通过多平面截面提取2D模式并首次在重建中引入草图约束,实现更精细、可编辑的参数化CAD模型重建。

Details Motivation: 现有深度学习方法在将3D扫描转换为参数化CAD模型时存在无法生成完整参数输出或忽略几何细节的问题,且忽视了草图级约束这一关键因素。 Method: 采用多平面截面提取2D模式,模仿人类设计师手动建模方式,结合自底向上与自顶向下的策略,在重建过程中显式引入草图约束。 Result: 能够重建出更详细且可编辑的参数化CAD模型,在性能上优于当前最先进方法,并首次实现了草图约束的集成。 Conclusion: 该方法有效提升了CAD逆向工程的精度与可编辑性,推动了从3D扫描到参数化CAD模型的自动化重建发展。 Abstract: Computer-Aided Design (CAD) plays a foundational role in modern manufacturing and product development, often requiring designers to modify or build upon existing models. Converting 3D scans into parametric CAD representations--a process known as CAD reverse engineering--remains a significant challenge due to the high precision and structural complexity of CAD models. Existing deep learning-based approaches typically fall into two categories: bottom-up, geometry-driven methods, which often fail to produce fully parametric outputs, and top-down strategies, which tend to overlook fine-grained geometric details. Moreover, current methods neglect an essential aspect of CAD modeling: sketch-level constraints. In this work, we introduce a novel approach to CAD reverse engineering inspired by how human designers manually perform the task. Our method leverages multi-plane cross-sections to extract 2D patterns and capture fine parametric details more effectively. It enables the reconstruction of detailed and editable CAD models, outperforming state-of-the-art methods and, for the first time, incorporating sketch constraints directly into the reconstruction process.

[305] CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification

Asmaa Abbas,Mohamed Gaber,Mohammed M. Abdelsamea

Main category: cs.CV

TL;DR: 本文提出了一种名为CURVETE的新型深度卷积神经网络,结合课程学习和渐进式自监督训练,以应对医学图像分析中样本稀缺和类别分布不均的问题,在多个医学图像数据集上表现出优越的分类性能。

Details Motivation: 医学图像分析中标注样本的质量和可获得性存在挑战,且类别分布不均衡时传统微调效果下降。 Method: 提出CURVETE模型,采用基于样本分解粒度的课程学习策略训练无标签通用样本,并在下游任务中引入类别分解方法以应对类别不平衡。 Result: 在脑肿瘤、数字膝关节X光和Mini-DDSM数据集上,使用ResNet-50分别达到96.60%、75.60%和93.35%的准确率;使用DenseNet-121也取得优异结果,优于其他训练策略。 Conclusion: CURVETE通过课程学习和类别分解有效提升了模型泛化能力和分类性能,尤其适用于小样本和类别不平衡的医学图像任务。 Abstract: Identifying high-quality and easily accessible annotated samples poses a notable challenge in medical image analysis. Transfer learning techniques, leveraging pre-training data, offer a flexible solution to this issue. However, the impact of fine-tuning diminishes when the dataset exhibits an irregular distribution between classes. This paper introduces a novel deep convolutional neural network, named Curriculum Learning and Progressive Self-supervised Training (CURVETE). CURVETE addresses challenges related to limited samples, enhances model generalisability, and improves overall classification performance. It achieves this by employing a curriculum learning strategy based on the granularity of sample decomposition during the training of generic unlabelled samples. Moreover, CURVETE address the challenge of irregular class distribution by incorporating a class decomposition approach in the downstream task. The proposed method undergoes evaluation on three distinct medical image datasets: brain tumour, digital knee x-ray, and Mini-DDSM datasets. We investigate the classification performance using a generic self-supervised sample decomposition approach with and without the curriculum learning component in training the pretext task. Experimental results demonstrate that the CURVETE model achieves superior performance on test sets with an accuracy of 96.60% on the brain tumour dataset, 75.60% on the digital knee x-ray dataset, and 93.35% on the Mini-DDSM dataset using the baseline ResNet-50. Furthermore, with the baseline DenseNet-121, it achieved accuracies of 95.77%, 80.36%, and 93.22% on the brain tumour, digital knee x-ray, and Mini-DDSM datasets, respectively, outperforming other training strategies.

[306] FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network

Fangtong Sun,Congyu Li,Ke Yang,Yuchen Pan,Hanwen Yu,Xichuan Zhang,Yiying Li

Main category: cs.CV

TL;DR: 本文提出了一种基于频域通道比的新型模块FRBNet,用于提取光照不变特征,有效提升低光条件下目标检测与分割等下游任务的性能。

Details Motivation: 现有低光图像处理方法对低光条件建模不完整,导致在检测和分割等下游任务中表现受限,因此需要更精确地刻画低光成像过程。 Method: 通过扩展经典Lambertian模型并在频域分析,提出利用频域通道比结合可学习滤波器来提取光照不变特征,并设计了端到端可训练的FRBNet模块。 Result: FRBNet在多个下游任务中表现出色,显著提升了低光目标检测(+2.2 mAP)和夜间分割(+2.9 mIoU)性能。 Conclusion: FRBNet通过频域建模有效增强了低光环境下的特征表示能力,是一种即插即用、通用性强的低光视觉增强模块。 Abstract: Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.

[307] Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Shijian Wang,Jiarui Jin,Xingjian Wang,Linxin Song,Runhao Fu,Hecheng Wang,Zongyuan Ge,Yuan Lu,Xuelian Cheng

Main category: cs.CV

TL;DR: 本文提出了Video-Thinker,一种使多模态大语言模型(MLLMs)能够通过自主利用其内在的“定位”和“描述”能力进行视频推理的方法,并构建了包含自主工具使用的数据集Video-Thinker-10K,结合监督微调与强化学习策略,在多个视频推理基准上实现了最先进的性能。

Details Motivation: 尽管图像推理方法在多模态大语言模型中取得了成功,但该范式尚未扩展到视频推理任务。因此,需要一种能直接让MLLMs通过视频进行自主推理的方法,避免依赖外部工具。 Method: 提出Video-Thinker,利用MLLMs自身的定位和描述能力生成推理线索;构建Video-Thinker-10K数据集,采用监督微调(SFT)学习推理格式,再通过组相对策略优化(GRPO)增强推理能力。 Result: Video-Thinker在多个视频反事实推理、复杂场景理解和跨领域推理基准(如Video-Holmes、CG-Bench-Reasoning、VRBench)上显著优于现有方法,7B规模模型表现达到当前最优水平。 Conclusion: Video-Thinker成功实现了MLLMs对视频的自主推理,无需外部工具介入,通过内在能力驱动推理过程,为视频理解提供了高效且可扩展的新范式。 Abstract: Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

[308] UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

Karthikeyan Chandra Sekaran,Markus Geisler,Dominik Rößle,Adithya Mohan,Daniel Cremers,Wolfgang Utschick,Michael Botsch,Werner Huber,Torsten Schön

Main category: cs.CV

TL;DR: 本文介绍了UrbanIng-V2X,首个大规模、多模态的车联网感知数据集,覆盖德国英戈尔施塔特市三个城市交叉口,包含多个车辆与基础设施传感器的协同感知数据,旨在解决现有数据集场景单一的问题,支持更全面的算法评估。

Details Motivation: 现有的真实世界合作感知数据集通常局限于单一交叉口或单个车辆,缺乏多样性,容易导致模型过拟合和性能评估偏差。因此,需要一个涵盖多交叉口、多车辆与基础设施协同的综合性数据集以提升算法在多样化交通环境中的泛化能力。 Method: 作者构建了UrbanIng-V2X数据集,部署于德国英戈尔施塔特的三个城市交叉口,包含34段20秒长的时间同步与空间标定的传感器序列,每段涉及两辆车辆和最多三个基础设施传感器杆,采集来自车载与基础设施端的多种传感器数据,并以10Hz频率标注13类物体的3D边界框。 Result: UrbanIng-V2X包含约71.2万个标注实例,涵盖12个车载RGB相机、2个车载LiDAR、17个基础设施热成像相机和12个基础设施LiDAR的数据,提供了高精度的多模态、多视角感知数据。同时发布了配套代码库、高清地图和完整的数字孪生环境。 Conclusion: UrbanIng-V2X是首个支持多交叉口、多车辆与基础设施协同感知的大规模数据集,显著提升了合作感知研究的基准测试能力,有助于推动智能出行应用中复杂城市环境下的感知算法发展。 Abstract: Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.

[309] MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Xin Jin,Siyuan Li,Siyong Jian,Kai Yu,Huan Wang

Main category: cs.CV

TL;DR: 提出MergeMix,一种结合监督微调和强化学习优势的训练时增强范式,通过注意力感知的图像混合与偏好驱动训练,在多模态大语言模型中实现高效且可扩展的视觉-语言对齐。

Details Motivation: 现有视觉-语言对齐方法在监督微调(SFT)和强化学习(RL)之间存在权衡:SFT稳定但依赖大量标注且难以捕捉细粒度偏好,RL能引入奖励信号但开销大且不稳定。 Method: 提出MergeMix,首先通过token merge进行注意力感知的图像混合,增强聚类表示和空间上下文;然后构建混合图像与原始图像的偏好对,使用SimPO损失进行偏好驱动训练。 Result: 实验证明MergeMix在分类任务中优于其他启发式方法,具有更高的注意力一致性和效率,在准确率和训练效率方面均表现优越。 Conclusion: MergeMix有效桥接了SFT与RL的优势,为多模态大语言模型提供了一种可扩展、高效的偏好对齐方案。 Abstract: Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

[310] On the Faithfulness of Visual Thinking: Measurement and Enhancement

Zujing Liu,Junwen Pan,Qi She,Yuan Gao,Guisong Xia

Main category: cs.CV

TL;DR: 本文研究了大型视觉-语言模型(LVLMs)在强化微调后生成的多模态思维链(MCoT)中视觉信息不准确的问题,提出了一种无需标注的新学习策略SCCM,以提升MCoT中视觉推理的忠实性。

Details Motivation: 发现当前MCoT中的视觉信息虽被使用但不可靠且不充分,模型在推理过程中忽视视觉内容,仅依赖文本线索,导致推理过程缺乏忠实性。 Method: 通过干预实验分析MCoT中视觉与文本思维的影响,并提出基于LVLM的自动化评估指标衡量视觉线索的可靠性和充分性;进一步提出SCCM学习策略,鼓励模型生成足够且最小化的关键视觉组件。 Result: 实验证明SCCM能显著提升MCoT在细粒度感知与推理任务中视觉信息的忠实性,且该方法无需额外标注,可即插即用兼容现有RFT流程。 Conclusion: SCCM有效改善了LVLM中MCoT推理过程对视觉信息的依赖与准确性,增强了多模态推理的可解释性与可靠性。 Abstract: Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.

[311] Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap

Elisabeth Jüttner,Leona Krath,Stefan Korfhage,Hannah Dröge,Matthias B. Hullin,Markus Plack

Main category: cs.CV

TL;DR: 提出一种混合光照重置框架,结合扩散模型的材质先验与时间正则化和物理渲染,实现稳定且可扩展的体视频重光照。

Details Motivation: 现有方法在处理体视频重光照时存在时间不稳定性和生成质量不足的问题,尤其是扩散模型在序列上的随机噪声和内存限制。 Method: 结合扩散模型提取每帧材质属性,通过光流引导的时间正则化聚合为一致的着色分量,并利用高斯不透明场生成网格代理,在标准图形管线中渲染间接光照效果。 Result: 在真实和合成数据上实验表明,该方法相比纯扩散模型基线显著提升了时间稳定性,并能扩展到更长的视频片段。 Conclusion: 融合学习先验与物理约束的混合方法是实现生产级体视频重光照的有效路径。 Abstract: Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

[312] VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Walid Bousselham,Hilde Kuehne,Cordelia Schmid

Main category: cs.CV

TL;DR: 提出VOLD框架,利用文本推理资源通过强化学习和策略蒸馏提升视觉语言模型的复杂推理能力。

Details Motivation: 由于高质量图文推理数据稀缺,而文本推理资源丰富,如何有效利用文本推理能力提升视觉语言模型的复杂推理能力是一个开放问题。 Method: 提出VOLD框架,结合组相对策略优化(GRPO)与在线策略蒸馏,利用纯文本教师模型指导视觉语言学生模型的推理过程,并强调冷启动对齐的重要性。 Result: 在MMMU-Pro、MathVision、MathVista和LogicVista等多个基准上显著优于基线模型和现有方法,验证了VOLD的有效性。 Conclusion: VOLD通过文本教师模型的推理轨迹指导学生模型,在无需大量标注图文数据的情况下显著提升视觉语言模型的复杂推理性能,冷启动对齐是成功迁移的关键。 Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

[313] iPac: Incorporating Intra-image Patch Context into Graph Neural Networks for Medical Image Classification

Usama Zidan,Mohamed Gaber,Mohammed M. Abdelsamea

Main category: cs.CV

TL;DR: 提出iPac方法,通过构建有意义的图像图表示来提升图神经网络在医学图像分类中的性能。

Details Motivation: 现有图神经网络在图像分类中对视觉实体间的结构和关系考虑不足,尤其在医学图像中限制了性能。 Method: 引入iPac,整合图像分块、特征提取、聚类、图构建与图学习,形成统一网络,利用聚类组织特征并构建语义丰富的图表示。 Result: 在多个医学图像数据集上实验表明,相比基线方法平均准确率提升达5%。 Conclusion: iPac通过有效建模视觉实体间的结构关系,为医学图像分类提供了一种通用且有效的图神经网络解决方案。 Abstract: Graph neural networks have emerged as a promising paradigm for image processing, yet their performance in image classification tasks is hindered by a limited consideration of the underlying structure and relationships among visual entities. This work presents iPac, a novel approach to introduce a new graph representation of images to enhance graph neural network image classification by recognizing the importance of underlying structure and relationships in medical image classification. iPac integrates various stages, including patch partitioning, feature extraction, clustering, graph construction, and graph-based learning, into a unified network to advance graph neural network image classification. By capturing relevant features and organising them into clusters, we construct a meaningful graph representation that effectively encapsulates the semantics of the image. Experimental evaluation on diverse medical image datasets demonstrates the efficacy of iPac, exhibiting an average accuracy improvement of up to 5% over baseline methods. Our approach offers a versatile and generic solution for image classification, particularly in the realm of medical images, by leveraging the graph representation and accounting for the inherent structure and relationships among visual entities.

[314] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Yaoli Liu,Yao-Xiang Ding,Kun Zhou

Main category: cs.CV

TL;DR: FreeFuse是一种无需训练的多主体文本到图像生成方法,通过自动融合多个主体LoRAs实现高效生成。

Details Motivation: 现有方法依赖复杂的预推理LoRA权重合并或辅助模型(如分割模型)来隔离LoRA输出,限制了实用性与效率。 Method: 提出利用交叉注意力层权重自动生成上下文感知的动态主体掩码,并将其直接应用于LoRA输出以实现精确的主题融合。 Result: 实验表明FreeFuse在生成质量和可用性方面优于现有方法,且无需训练、不修改LoRA、无需辅助模型或用户定义模板。 Conclusion: FreeFuse提供了一种高效、实用的多主体文本到图像生成方案,显著简化了工作流程并提升了生成效果。 Abstract: This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/

[315] DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation

Wanmeng Li,Simone Mosco,Daniel Fusaro,Alberto Pretto

Main category: cs.CV

TL;DR: 本文提出了一种动态伪标签过滤(DPLF)方案和先验引导的数据增强管道(PG-DAP),以提升点云无监督域适应语义分割中对真实数据的利用,并通过实验验证了方法的优越性。

Details Motivation: 现有基于自训练的无监督域适应方法在利用未标注真实点云数据时效果不佳,通常依赖固定的置信度阈值,导致性能受限。 Method: 提出动态伪标签过滤(DPLF)机制,自适应选择高质量伪标签;设计先验引导的数据增强管道(PG-DAP)缓解合成与真实点云间的域偏移;引入数据混合一致性损失,促使模型学习上下文无关的表示。 Result: 在两个具挑战性的合成到真实的点云语义分割任务上,所提方法优于现有最先进方法,消融实验验证了DPLF和PG-DAP的有效性。 Conclusion: DPLF和PG-DAP显著提升了点云UDA语义分割的性能,有效增强了对未标注真实数据的利用,为实际应用提供了高效解决方案。 Abstract: Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.

[316] EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei,Yifei Huang,Jilan Xu,Yuping He,Guo Chen,Fei Wu,Yu Qiao,Jiangmiao Pang

Main category: cs.CV

TL;DR: 本文提出了EgoThinker框架,通过时空思维链监督和两阶段学习课程,提升多模态大语言模型在自我中心视频推理中的能力。

Details Motivation: 现有的多模态大语言模型在可见事件推理上表现良好,但缺乏对摄像头后代理的隐含意图和细粒度交互的具身化第一人称理解。 Method: 构建了大规模自我中心问答数据集EgoRe-5M,并采用监督微调(SFT)和强化微调(RFT)来增强模型的时空定位能力。 Result: EgoThinker在多个自我中心基准测试中优于现有方法,并在细粒度时空定位任务中显著提升性能。 Conclusion: EgoThinker有效提升了多模态大语言模型在自我中心视频推理中的表现,推动了该领域的发展。 Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

[317] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin,Dingkang Liang,Mingyang Du,Xin Zhou,Xiang Bai

Main category: cs.CV

TL;DR: MERGE是一个从预训练文本到图像扩散模型出发的统一图像生成与深度估计模型,通过即插即用框架和组重用机制,在保持原始生成能力的同时实现了先进的零样本深度估计性能。

Details Motivation: 生成式深度估计方法虽利用预训练扩散模型的视觉先验展现出强大的零样本能力,但训练中的参数更新会导致模型生成能力严重退化,因此需要一种既能保留生成能力又能有效进行深度估计的统一模型。 Method: 提出MERGE模型,采用固定预训练文本到图像模型,设计即插即用框架实现图像生成与深度估计模式的无缝切换,并引入组重用机制以提升可学习参数的利用效率和参数复用。 Result: MERGE在多个深度估计基准上达到最先进的性能,同时保持了原始的图像生成能力,优于其他统一模型。 Conclusion: MERGE成功挖掘了预训练文本到图像模型在深度估计方面的潜力,实现了生成与理解任务的高效统一,为多任务模型设计提供了新思路。 Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE

[318] Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Junyoung Seo,Rodrigo Mira,Alexandros Haliassos,Stella Bounareli,Honglie Chen,Linh Tran,Seungryong Kim,Zoe Landgraf,Jie Shen

Main category: cs.CV

TL;DR: 提出了一种名为Lookahead Anchoring的方法,通过利用未来时间步的帧作为引导,解决音频驱动人物动画生成中的身份漂移问题,无需额外的关键帧生成阶段。

Details Motivation: 解决现有音频驱动人物动画模型在自回归生成过程中出现的身份漂移问题,同时避免引入关键帧导致运动自然性受限。 Method: 引入Lookahead Anchoring机制,使用未来时间步的关键帧作为导向目标,使模型在响应当前音频信号的同时持续追踪这些未来锚点;并实现自关键帧化,即以参考图像作为前瞻目标,省去关键帧生成过程。 Result: 在三个最新的人体动画模型上应用该方法后,显著提升了唇部同步、身份保持和视觉质量,验证了其在不同架构下的时序一致性改善效果。 Conclusion: Lookahead Anchoring能有效平衡动作表现力与身份一致性,通过前瞻距离调节性能,且无需额外关键帧生成,具备广泛适用性。 Abstract: Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.

[319] FARMER: Flow AutoRegressive Transformer over Pixels

Guangting Zheng,Qinyu Zhao,Tao Yang,Fei Xiao,Zhijie Lin,Jie Wu,Jiajun Deng,Yanyong Zhang,Rui Zhu

Main category: cs.CV

TL;DR: 提出FARMER,一种统一归一化流与自回归模型的端到端生成框架,实现从原始像素进行可追踪似然估计和高质量图像合成。

Details Motivation: 直接对原始数据分布建模在语言模型中成功扩展,但在视觉像素上的自回归建模因序列过长和高维空间而受限。 Method: 采用可逆自回归流将图像转换为潜在序列,并用自回归模型隐式建模其分布;引入自监督降维方法划分潜在通道,并设计一步蒸馏和重采样分类器自由引导算法。 Result: 实验表明FARMER在像素级生成任务中具有竞争力,同时提供精确似然估计和可扩展训练,并提升推理速度与生成质量。 Conclusion: FARMER有效结合NF与AR模型,解决了像素级自回归建模中的效率与复杂性问题,实现了高性能图像生成与准确似然估计。 Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

[320] InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

Erich Liang,Roma Bhattacharjee,Sreemanti Dey,Rafael Moschopoulos,Caitlin Wang,Michel Liao,Grace Tan,Andrew Wang,Karhan Kayan,Stamatis Alexandropoulos,Jia Deng

Main category: cs.CV

TL;DR: 本文提出了Intrinsics in Flux (InFlux),一个包含动态相机内参的真实世界视频基准数据集,提供386个高分辨率室内外视频的14.3万帧以上逐帧内参标注,显著提升了场景和内参变化的多样性,并通过改进Kalibr工具箱确保标注精度。实验表明现有方法在动态内参预测上表现不佳。

Details Motivation: 大多数3D视觉算法假设相机内参在视频中保持恒定,但实际野外视频中内参常动态变化。现有数据集缺乏足够的场景多样性和连续帧的逐帧内参标注,限制了该领域的发展。 Method: 构建了一个大规模真实世界基准数据集InFlux,包含386个高分辨率视频共143K+帧,提供逐帧真实内参标注;通过建立全面的标定实验查找表,并扩展Kalibr工具箱以提高标定的准确性和鲁棒性。 Result: InFlux数据集涵盖了更广泛的内参变化和场景多样性;评估结果显示现有内参预测方法在动态内参视频上表现较差,难以实现精确预测。 Conclusion: InFlux为动态相机内参建模提供了高质量基准,揭示了现有方法的不足,推动未来研究关注视频中时变内参的估计问题。 Abstract: Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.

[321] PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian,Cheng Wan,Chao Jia,Yinfei Yang,Qingyu Zhao,Zhe Gan

Main category: cs.CV

TL;DR: PRISM-Bench 是一个基于视觉谜题的基准,用于评估多模态大语言模型在符号、几何和类比推理中的逻辑一致性和错误检测能力,通过识别推理链中的首个错误步骤来实现细粒度诊断。

Details Motivation: 现有评估方法仅关注最终答案准确性,缺乏对模型推理过程的深入分析,难以揭示模型在逻辑一致性与视觉推理中的缺陷。 Method: 构建包含多步推理的视觉谜题,并引入诊断任务:给定含一个错误的逐步推理链,要求模型识别出第一个错误步骤,从而评估其推理质量。 Result: 实验表明,当前先进的多模态大语言模型虽能生成看似合理的推理链,但普遍难以发现其中的简单逻辑错误,暴露出推理不忠实的问题。 Conclusion: PRISM-Bench 提供了一种更精细的多模态推理能力评估方式,强调开发可信模型需采用能够分离答案生成与推理验证的诊断性评估协议。 Abstract: We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

[322] PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan,Wenqiao Zhang,Xin Li,Shihao Wang,Kehan Li,Wentong Li,Jun Xiao,Lei Zhang,Beng Chin Ooi

Main category: cs.CV

TL;DR: 本文提出了PixelRefer,一个统一的区域级多模态大语言模型框架,支持对图像和视频中用户指定区域进行细粒度理解。

Details Motivation: 现有MLLM主要关注场景级理解,缺乏对细粒度、以对象为中心的推理能力。 Method: 提出Scale-Adaptive Object Tokenizer(SAOT)生成紧凑且语义丰富的对象表示,并设计Object-Centric Infusion模块实现全局上下文预融合,构建轻量化的PixelRefer-Lite。 Result: 在多个基准测试中,PixelRefer以更少训练样本达到领先性能,PixelRefer-Lite在保持高精度的同时显著提升效率。 Conclusion: PixelRefer实现了高效的区域级细粒度视觉理解,为多模态模型在对象级任务中的应用提供了有效解决方案。 Abstract: Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

[323] Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Shuhong Zheng,Ashkan Mirzaei,Igor Gilitschenski

Main category: cs.CV

TL;DR: 本文提出了一种名为TIRE(Track, Inpaint, REsplat)的新方法,用于提升个性化3D/4D生成中的主体语义一致性。该方法通过视频跟踪、2D图像修复和3D重渲染技术,显著提高了跨视角下的身份保持能力。

Details Motivation: 现有3D/4D生成方法在个性化内容中难以保持主体的语义身份一致性,尤其是在多视角下。因此,需要一种能够有效保留特定主体身份特征的生成方法。 Method: TIRE方法首先利用现有3D生成模型产生的初始3D资产,通过视频跟踪识别需修改的区域,然后使用基于主体驱动的2D图像修复模型逐步填充这些区域,最后将修改后的多视角2D观测结果重新投影回3D空间,同时保持一致性。 Result: 实验表明,与当前最先进的方法相比,TIRE在3D/4D生成中显著提升了主体身份的保持效果,尤其在多视角一致性方面表现优异。 Conclusion: TIRE为个性化3D/4D内容生成提供了一个有效的解决方案,能够在保持高视觉质量的同时,显著增强主体语义身份的跨视角一致性。 Abstract: Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.

[324] Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Yujia Zhang,Xiaoyang Wu,Yixing Lao,Chengyao Wang,Zhuotao Tian,Naiyan Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: Concerto是一种受人类多感官概念学习启发的简约模型,通过3D自蒸馏和2D-3D跨模态嵌入学习空间表征,在3D场景理解中显著优于现有方法。

Details Motivation: 受人类通过多感官协同形成抽象概念的启发,希望构建能从单模态召回跨模态空间表征的学习模型。 Method: 结合3D同模态自蒸馏与2D-3D跨模态联合嵌入,实现多模态空间特征学习,并设计变体支持视频提升点云理解和CLIP语言空间映射。 Result: 在线性探测中分别超越最先进的2D和3D自监督模型14.2%和4.8%,在ScanNet上达到80.7% mIoU,并实现开集感知能力。 Conclusion: Concerto能够生成具有更优几何细节和语义一致性的空间表征,为多模态空间认知提供了高效框架。 Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.