Table of Contents
cs.CL [Back]
[1] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning
Xiang Gao,Yuguang Yao,Qi Zhang,Kaiwen Dong,Avinash Baidya,Ruocheng Guo,Hilaf Hasson,Kamalika Das
Main category: cs.CL
TL;DR: 提出RIMRULE,一种基于动态规则注入的神经符号方法,通过从失败轨迹中提取简洁规则并注入提示来提升大模型在特定工具使用中的准确性,且规则可跨模型迁移。
Details
Motivation: 大语言模型在使用领域特定工具时表现不佳,尤其是面对不常见或缺乏文档的API,需要有效的方法来适应任务特定的工具。 Method: 提出RIMRULE方法,利用最小描述长度(MDL)目标从失败轨迹中提炼出紧凑、可解释的规则,并将这些由LLM自身生成的规则以自然语言和结构化符号形式存储,在推理时注入提示中以提升性能。 Result: 在工具使用基准测试中,RIMRULE提升了对已见和未见工具的准确率,无需修改模型权重,优于基于提示的适应方法,并能与微调互补;此外,从一个LLM学到的规则可用于改进其他LLM,包括长推理模型。 Conclusion: 动态规则注入是一种高效、可解释且可移植的LLM适应方法,能够在不更新模型参数的情况下提升其在专用工具环境中的性能,展现了符号知识在不同架构间迁移的潜力。 Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.[2] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning
Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
Main category: cs.CL
TL;DR: 本文提出了MetaJuLS,一种基于元强化学习的通用约束传播方法,可在多种语言和任务上实现快速自适应的结构化推理,显著提升推理速度并降低碳足迹。
Details
Motivation: 大型语言模型在JSON模式强制、多语言解析等任务中需要满足复杂约束的结构化推理,但现有方法依赖任务特定训练,效率低且难以泛化。 Method: 将结构化推理建模为自适应约束传播问题,使用图注意力网络(GAT)结合元强化学习训练通用策略,使模型能在新任务和语言上通过极少梯度步快速适应。 Result: 在10种语言的Universal Dependencies和受限LLM生成任务(LogicBench, GSM8K-Constrained)上,MetaJuLS比GPU优化基线快1.5–2.0倍,准确率与最先进解析器相差在0.2%以内;仅用5–10步梯度更新(5–15秒)即可完成跨域适应。机制分析显示其学习到类人‘易先’解析策略及新颖启发式规则。 Conclusion: MetaJuLS实现了无需任务重训练的通用约束传播,显著提升了结构化推理效率与泛化能力,并通过减少推理步骤推动绿色AI发展。 Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.[3] Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description
Yongmin Yoo,Kris W Pan
Main category: cs.CL
TL;DR: 本文提出Pat-DEVAL,首个专用于评估专利说明书的多维框架,结合大模型与法律约束推理(CoLT),在技术合理性和法律合规性上均表现出与专家高度相关的评估能力。
Details
Motivation: 现有自动化专利生成评估方法无法有效衡量长文本结构连贯性及专利法规定的合规性要求,如充分公开和书面描述。 Method: 提出Pat-DEVAL框架,采用LLM-as-a-judge范式,引入Chain-of-Legal-Thought(CoLT)机制,强制模型按专利法逻辑顺序进行分步评估,并在自建的Pap2Pat-EvalGold数据集上验证。 Result: Pat-DEVAL在整体评估中达到0.69的皮尔逊相关系数,优于基线指标和现有LLM评估器;在法律合规性维度达0.73相关性,证明引入法定约束的有效性。 Conclusion: Pat-DEVAL通过融合法律约束推理,为自动化专利撰写系统的实际部署提供了兼顾技术完整性与法律合规性的可靠评估基础。 Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.[4] Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation
Cheonkam Jeong,Adeline Nyamathi
Main category: cs.CL
TL;DR: 本文通过系统分析IEMOCAP数据集,研究对话情绪识别(ERC)中的关键架构选择及其与语言生成的联系。在识别方面,发现对话上下文至关重要,层级句子表示和外部情感词典无显著增益;在语言分析方面,发现情绪与话语标记位置显著相关,尤其是“悲伤”情绪较少使用左边缘标记,依赖上下文进行消歧。
Details
Motivation: 解决当前情绪识别对话(ERC)中对关键架构选择理解不足以及缺乏识别与生成之间语言学联系分析的问题。 Method: 在IEMOCAP数据集上进行系统的消融实验(10次随机种子评估),分析不同架构组件的影响,并对5,286个话语标记进行语言学统计分析,探讨情绪类别与话语结构的关系。 Result: 发现90%的情绪识别增益来自最近10-30轮对话上下文;层级表示在提供上下文后无额外帮助;SenticNet等外部情感词典无提升作用。简单因果结构即达到82.69%(4类)和67.07%(6类)加权F1,超越以往文本方法。语言分析显示情绪与话语标记位置显著相关(p < .0001),特别是“悲伤”情绪左边缘标记使用率较低(21.9% vs 28-32%)。 Conclusion: 对话上下文是情绪识别的核心,足以取代层级结构和外部词典;情绪表达具有可量化的语言学模式,特别是‘悲伤’更依赖上下文消歧,因其缺乏显式语用信号,这为情绪识别与生成的统一建模提供了依据。 Abstract: While Emotion Recognition in Conversation (ERC) has achieved high accuracy, two critical gaps remain: a limited understanding of \textit{which} architectural choices actually matter, and a lack of linguistic analysis connecting recognition to generation. We address both gaps through a systematic analysis of the IEMOCAP dataset. For recognition, we conduct a rigorous ablation study with 10-seed evaluation and report three key findings. First, conversational context is paramount, with performance saturating rapidly -- 90\% of the total gain achieved within just the most recent 10--30 preceding turns (depending on the label set). Second, hierarchical sentence representations help at utterance-level, but this benefit disappears once conversational context is provided, suggesting that context subsumes intra-utterance structure. Third, external affective lexicons (SenticNet) provide no gain, indicating that pre-trained encoders already capture necessary emotional semantics. With simple architectures using strictly causal context, we achieve 82.69\% (4-way) and 67.07\% (6-way) weighted F1, outperforming prior text-only methods including those using bidirectional context. For linguistic analysis, we analyze 5,286 discourse marker occurrences and find a significant association between emotion and marker positioning ($p < .0001$). Notably, "sad" utterances exhibit reduced left-periphery marker usage (21.9\%) compared to other emotions (28--32\%), consistent with theories linking left-periphery markers to active discourse management. This connects to our recognition finding that sadness benefits most from context (+22\%p): lacking explicit pragmatic signals, sad utterances require conversational history for disambiguation.[5] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models
Wang Xing,Wei Song,Siyu Lin,Chen Wu,Zhesi Li,Man Wang
Main category: cs.CL
TL;DR: 提出一种针对时序知识图谱推理的蒸馏框架,利用大语言模型作为教师模型,将结构和时序推理能力有效传递给轻量级学生模型,在多个基准数据集上实现了精度、效率与可部署性的良好平衡。
Details
Motivation: 现有时序知识图谱(TKG)推理模型参数多、计算开销大,难以部署在资源受限设备上;且传统压缩方法难以捕捉时序依赖,导致性能下降。 Method: 设计一种面向TKG推理的蒸馏框架,使用大语言模型作为教师模型,结合大规模公开知识与任务特定时序信息,指导轻量级学生模型学习结构与时序推理能力。 Result: 在多个公开基准数据集上的实验表明,该方法在推理准确性和计算效率之间取得了优于强基线模型的平衡。 Conclusion: 所提蒸馏框架能有效提升轻量级模型在TKG推理任务中的表现,兼顾性能与部署实用性,推动TKG在低功耗、分布式场景中的应用。 Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.[6] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
Jinning Zhang,Jie Song,Wenhui Tu,Zecheng Li,Jingxuan Li,Jin Li,Xuan Liu,Taole Sha,Zichen Wei,Yan Li
Main category: cs.CL
TL;DR: 本研究提出了一种将循证医学(EBM)原则融入基于图的检索增强生成(RAG)系统的方法,通过PICO框架对齐和基于证据等级的贝叶斯重排序算法,提升了医学领域LLMs输出的质量与可信度,并在运动康复领域构建了知识图谱与基准数据集进行验证。
Details
Motivation: 现有RAG方法在医学应用中忽视了循证医学(EBM)原则,缺乏PICO结构对齐和证据等级分层考虑,导致检索结果和生成答案的临床可靠性不足。 Method: 将PICO框架整合到知识图谱的构建与检索过程中,并设计一种受贝叶斯启发的重排序算法,在不引入预设权重的情况下根据证据等级校准排序分数,从而实现符合EBM原则的RAG系统。 Result: 在运动康复领域构建了包含357,844个节点和371,226条边的知识图谱及含1,637个问答对的基准测试集;系统在nugget覆盖、答案忠实性、语义相似性和PICOT匹配准确率上分别达到0.830、0.819、0.882和0.788;五位临床专家在五点李克特量表上的评分为4.66–4.84分(满分5分)。 Conclusion: 所提出的EBM适应策略能有效提升RAG系统的检索与回答质量,具备跨临床领域的可迁移性,且发布的资源填补了运动康复领域RAG数据集的空白。 Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.[7] JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation
Leonard Lin,Adam Lensenmayer
Main category: cs.CL
TL;DR: JP-TL-Bench 是一个轻量级开源基准,用于评估日英翻译系统,通过成对LLM比较和Bradley-Terry模型聚合结果,提供稳定可靠的评分。
Details
Motivation: 传统翻译评估难以捕捉日英翻译中礼貌、隐含意义、省略和语域等细微差异,需要更精细的评估方法。 Method: 采用无需参考译文的成对LLM判断,将候选模型与固定的锚定集进行比较,并使用Bradley-Terry模型聚合胜率,生成0-10的标准化LT分数。 Result: 实现了可靠且低成本的翻译质量评估,评分具有结构稳定性,便于不同模型间的公平比较。 Conclusion: JP-TL-Bench为日英翻译系统的迭代开发提供了一个高效、透明且可复现的评估框架。 Abstract: We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two good translations is better?" rather than "is this translation acceptable?" This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 "LT" score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.[8] Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback
Yan Sun,Ming Cai,Stanley Kok
Main category: cs.CL
TL;DR: 本文提出了一种面向企业级生成式AI系统的验证框架,通过Q*和Feedback+两种技术提升对话式商业分析系统的准确性和可靠性。
Details
Motivation: 现有的对话式商业分析系统缺乏内置的验证机制,导致用户需手动验证结果,影响效率与可信度。 Method: 提出了Q*(反向翻译与语义匹配)和Feedback+(执行反馈引导代码优化)两种验证技术,并嵌入生成器-判别器框架中实现自动化验证。 Result: 在Spider、Bird和GSM8K三个基准数据集上的实验表明,Q*和Feedback+均能有效降低错误率并缩短任务完成时间,但反向翻译仍是瓶颈。 Conclusion: 该工作提供了一个面向设计的框架,有助于构建更可靠、可信赖的企业级生成式AI决策支持系统。 Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.[9] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation
Qianli Wang,Van Bach Nguyen,Yihong Liu,Fedor Splitt,Nils Feldhus,Christin Seifert,Hinrich Schütze,Sebastian Möller,Vera Schmitt
Main category: cs.CL
TL;DR: 本文研究了大语言模型在生成多语言反事实样本中的表现,发现翻译生成的反事实虽然有效性较高,但修改成本大且质量仍不及英文原生版本;同时揭示了跨语言扰动存在共性模式、常见错误类型,并指出多语言反事实数据增强优于跨语言方法,但生成质量限制了其对模型性能提升的效果。
Details
Motivation: 探究大语言模型在多语言场景下生成反事实样本的有效性和局限性,评估其在不同语言中的表现及对模型解释和增强的潜力。 Method: 通过自动评估六种语言中直接生成与经由英文翻译得到的反事实样本,比较其有效性、修改程度和质量,并分析编辑模式与错误类型,进一步评估多语言与跨语言反事实数据增强对模型性能的影响。 Result: 翻译生成的反事实有效性更高但需更多修改,且质量不及英文原生;欧洲高资源语言间的编辑模式相似;识别出四类跨语言常见错误;多语言反事实数据增强比跨语言方法更有效,尤其对低资源语言,但生成质量限制了整体增益。 Conclusion: 当前多语言反事实生成仍有改进空间,需结合语言特性优化生成策略以提升质量和实际应用效果。 Abstract: Counterfactuals refer to minimally edited inputs that cause a model's prediction to change, serving as a promising approach to explaining the model's behavior. Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency. However, their effectiveness in generating multilingual counterfactuals remains unclear. To this end, we conduct a comprehensive study on multilingual counterfactuals. We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages. Although translation-based counterfactuals offer higher validity than their directly generated counterparts, they demand substantially more modifications and still fall short of matching the quality of the original English counterfactuals. Second, we find the patterns of edits applied to high-resource European-language counterfactuals to be remarkably similar, suggesting that cross-lingual perturbations follow common strategic principles. Third, we identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages. Finally, we reveal that multilingual counterfactual data augmentation (CDA) yields larger model performance improvements than cross-lingual CDA, especially for lower-resource languages. Yet, the imperfections of the generated counterfactuals limit gains in model performance and robustness.[10] Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity
Doyoung Kim,Zhiwei Ren,Jie Hao,Zhongkai Sun,Lichao Wang,Xiyao Ma,Zack Ye,Xu Han,Jun Yin,Heng Ji,Wei Shen,Xing Fan,Benjamin Yao,Chenlei Guo
Main category: cs.CL
TL;DR: WildAGTEval是一个评估大语言模型(LLM)代理在真实API复杂性下函数调用能力的新基准,考虑了API规范和执行中的现实挑战,揭示了现有LLM在处理噪声和用户意图保持方面的不足。
Details
Motivation: 现有的LLM评估通常假设理想的API环境,忽略了现实世界中API输出噪声和运行时问题,无法真实反映模型在实际应用中的表现。因此需要一个更贴近现实的评估基准。 Method: 构建了一个包含60种不同复杂性场景的API系统,可组合成约32,000个测试配置,并收集了用户-代理交互数据以评估LLM代理在API规范(如文档和约束)和API执行(如运行时挑战)两个维度上的表现。 Result: 实验表明大多数场景对当前先进LLM具有挑战性,其中无关信息复杂性使强LLM性能下降27.3%;定性分析发现LLM常扭曲用户意图以声称任务完成,严重影响用户满意度。 Conclusion: WildAGTEval揭示了现有LLM代理在应对现实API复杂性方面的局限性,强调需改进模型对真实API环境的理解与忠实执行能力。 Abstract: We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.[11] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations
Qianli Wang,Nils Feldhus,Pepa Atanasova,Fedor Splitt,Simon Ostermann,Sebastian Möller,Vera Schmitt
Main category: cs.CL
TL;DR: 本文研究了量化对大语言模型自解释(SEs)质量和可信度的影响,发现量化通常导致适度下降,但不影响其作为模型压缩技术的有效性。
Details
Motivation: 自解释(SEs)在高风险应用中越来越依赖于透明性,而量化可能影响模型的决策过程,因此需要研究量化对SE质量和可信度的影响。 Method: 使用三种常见的量化技术在不同比特宽度下对大语言模型进行量化,并评估自然语言解释(NLEs)和反事实例子两类SE的表现,结合用户研究分析其连贯性和可信度。 Result: 量化通常导致SE质量下降最多4.4%,可信度下降最多2.38%,用户研究显示连贯性和信任度下降最多8.5%;大模型在保持可信度方面表现更好,但没有一种量化技术在所有指标上均占优。 Conclusion: 尽管量化对自解释有一定负面影响,尤其对NLE更敏感,但整体影响较小,不削弱其作为有效模型压缩方法的地位,建议针对具体应用场景验证SE质量。 Abstract: Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model's own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4\%) and faithfulness (up to 2.38\%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5\%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization's impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization's effectiveness as a model compression technique.[12] DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection
Yuxin Li,Xiangyu Zhang,Yifei Li,Zhiwei Guo,Haoyang Zhang,Eng Siong Chng,Cuntai Guan
Main category: cs.CL
TL;DR: 本文提出DepFlow,一种三阶段抑郁条件文本到语音框架,用于缓解抑郁症检测模型中的语义偏见,并构建CDoA数据集以增强模型鲁棒性。
Details
Motivation: 现有抑郁症数据集中语言情感与诊断标签强耦合,导致模型依赖语义捷径,在伪装性抑郁(语言积极但实际抑郁)等真实场景中表现不佳。 Method: 提出DepFlow框架:1)通过对抗训练学习与说话人和内容无关的抑郁声学编码;2)使用基于flow-matching的TTS模型结合FiLM调制注入抑郁特征;3)采用基于原型的严重度映射实现连续可控的抑郁语音合成,并构建CDoA数据集。 Result: DepFlow在ROC-AUC上达到0.693,成功解耦抑郁特征;构建的CDoA数据集使三种抑郁检测模型的macro-F1分别提升9%、12%和5%,优于传统增强方法。 Conclusion: DepFlow有效缓解了抑郁症检测中的语义偏见,提升了模型对伪装性抑郁的识别能力,同时提供了一个可控制的语音合成平台,适用于数据受限的临床场景。 Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.[13] Robust Uncertainty Quantification for Factual Generation of Large Language Models
Yuhao Zhang,Zhongliang Yang,Linna Zhou
Main category: cs.CL
TL;DR: 本文提出了一种新的不确定性量化方法(RU),用于检测和缓解大语言模型在多事实生成任务中的幻觉问题,特别是在包含虚假名称的陷阱问题上的表现。实验表明,该方法在四个不同模型上均显著优于基线方法,ROCAUC平均提升0.1-0.2。
Details
Motivation: 现有的不确定性量化方法在传统问答任务中有效,但在面对非标准或对抗性提问时表现不佳,限制了其在现实场景中的可靠性。因此,需要一种更鲁棒的方法来应对复杂生成任务中的幻觉问题。 Method: 构建包含虚假名称的陷阱问题数据集,并提出一种新的鲁棒不确定性量化方法(RU),通过多事实生成任务评估模型对幻觉内容的敏感性与识别能力。 Result: 实验结果显示,所构建的陷阱问题能有效触发模型幻觉,且所提RU方法在四个主流LLM上均显著优于现有基线方法,ROCAUC指标平均提升0.1-0.2,验证了其有效性与鲁棒性。 Conclusion: 本文提出的RU方法在检测大语言模型幻觉方面具有更强的鲁棒性和准确性,为解决复杂场景下的AI可信问题提供了新思路和有效工具。 Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.[14] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
Jiandong Shao,Raphael Tang,Crystina Zhang,Karin Sevegnani,Pontus Stenetorp,Jianfei Yang,Yao Lu
Main category: cs.CL
TL;DR: 本文研究了双语数据在多语言大模型预训练中的作用,发现尽管双语数据仅占语料库的2%,移除后翻译性能大幅下降56% BLEU,而其他跨语言任务表现稳定。通过细粒度消融实验发现,平行文本几乎完全恢复翻译能力(达到基线91%),而语码转换数据贡献极小,表明翻译依赖于平行数据中的词级对齐,而跨语言理解与推理则无需双语数据。
Details
Motivation: 探究双语数据在多语言大模型跨语言能力中的具体贡献,尤其是为何仅占少量的双语数据会对翻译性能产生显著影响。 Method: 从零开始在受控条件下预训练模型,比较标准网络语料库与去除所有多语言文档的单语语料库;进一步将双语数据分为平行、语码转换和杂项三类,并通过逐步重新引入进行细粒度消融实验。 Result: 移除双语数据导致翻译性能下降56% BLEU,但跨语言问答和推理任务表现稳定;重新引入平行数据可恢复91%的翻译性能,而语码转换数据影响甚微,其他任务不受显著影响。 Conclusion: 翻译能力高度依赖于平行数据提供的系统性词级对齐,而跨语言理解与推理能力可在无双语数据的情况下实现,说明不同跨语言任务对双语数据的依赖存在根本差异。 Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.[15] BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics
Taj Gillin,Adam Lalani,Kenneth Zhang,Marcel Mateos Salles
Main category: cs.CL
TL;DR: BERT-JEPA(BEPA)是一种新的训练范式,通过引入JEPA目标改进BERT模型,缓解[CLS]嵌入空间的坍缩问题,实现语言无关的表示,并提升多语言任务性能。
Details
Motivation: 解决BERT等模型在多语言场景下[CLS]嵌入空间坍缩、语言耦合的问题,增强跨语言表示能力。 Method: 在BERT风格模型中引入JEPA(Joint Embedding Predictive Architecture)自监督训练目标,利用预测机制学习更优的联合嵌入表示。 Result: 在多个多语言基准测试中表现出更高的性能,验证了方法在构建语言无关嵌入空间上的有效性。 Conclusion: JEPA目标能有效改善BERT类模型的嵌入质量,为多语言表示学习提供了新思路。 Abstract: Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.[16] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
Biao Wu,Meng Fang,Ling Chen,Ke Xu,Tao Cheng,Jun Wang
Main category: cs.CL
TL;DR: 本文提出Geo-R,一种无需检索的图像地理定位框架,通过强化学习和基于规则的层级推理(Chain of Region)从真实坐标生成可解释的推理路径,提升定位精度与泛化能力。
Details
Motivation: 现有基于视觉-语言模型的地理定位方法依赖合成数据或外部检索,导致可解释性和泛化性受限。 Method: 提出Chain of Region,将GPS坐标映射到国家、省份、城市等地理实体,生成结构化推理路径;采用基于Haversine距离的坐标对齐奖励,通过轻量级强化学习优化定位。 Result: 在多个基准上验证了Geo-R的有效性,实现了更高的定位精度和更好的泛化性能,同时提供透明的推理过程。 Conclusion: Geo-R建立了一种可扩展、可解释的无检索图像地理定位新范式,推动了地理推理与空间监督的结合。 Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.[17] Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset
Alistair Plum,Laura Bernardy,Tharindu Ranasinghe
Main category: cs.CL
TL;DR: 提出并评估了一种利用维基百科和Wikidata作为弱监督来源,通过大型语言模型自动标注和验证生成卢森堡语命名实体识别数据集judgeWEL的新方法。
Details
Motivation: 为了解决低资源语言在自然语言处理中因资源稀缺和语言特性导致的大规模注释成本高且不一致的问题。 Method: 利用维基百科文章中的内部链接,根据对应的Wikidata条目推断实体类型,生成初始标注,并使用多个大语言模型比较以识别和保留高质量的标注句子。 Result: 构建的语料库约为现有卢森堡语NER数据集的五倍,且在实体类别上具有更广泛和均衡的覆盖。 Conclusion: 该方法有效缓解了低资源语言数据标注的瓶颈,为多语言和低资源NER研究提供了重要新资源。 Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.[18] Toward Better Temporal Structures for Geopolitical Events Forecasting
Kian Ahrabian,Eric Boxer,Jay Pujara
Main category: cs.CL
TL;DR: 本文提出了一种新的超关系时序知识超图(HTKGH)形式化框架,以解决传统HTKG在表示涉及多个主体的复杂地缘政治事件时的局限性,并基于POLECAT数据库构建了htkgh-polecat数据集,评估了大语言模型在该任务上的关系预测性能。
Details
Motivation: 现有超关系时序知识图谱(HTKG)难以有效表达包含两个以上主要实体的复杂地缘政治事件,缺乏对复杂事实的建模能力。 Method: 提出了HTKGH的形式化定义,支持两类常见的复杂地缘政治事实类型;基于POLECAT构建了htkgh-polecat数据集;在关系预测任务上对主流大语言模型进行了基准测试。 Result: 成功构建了兼容原有HTKG结构的HTKGH模型和htkgh-polecat数据集;实验表明大语言模型在复杂时序预测任务中具有一定潜力但仍有挑战。 Conclusion: HTKGH能够更有效地建模复杂的地缘政治事件,为未来基于大语言模型的复杂知识推理提供了新方向和数据基础。 Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.[19] Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment
Muhammad Shahmeer Khan
Main category: cs.CL
TL;DR: 本研究比较了三种轻量级Transformer模型(DistilBERT、MiniLM和ALBERT)在三个企业NLP任务中的性能,发现各模型在准确性和效率之间存在权衡:ALBERT在准确性上表现最佳,MiniLM在推理速度上领先,DistilBERT则整体表现最均衡。
Details
Motivation: 在企业级自然语言处理快速发展的背景下,亟需高效且轻量的模型来支持多领域文本自动化任务,但不同轻量级模型在实际应用中的性能权衡尚不清晰,因此需要系统性比较以指导实际部署。 Method: 对DistilBERT、MiniLM和ALBERT三种轻量级Transformer模型在客户情感分类、新闻主题分类和仇恨言论检测三个任务上进行对比分析,使用IMDB、AG News和Measuring Hate Speech数据集,评估指标包括准确率、精确率、召回率、F1分数等性能指标,以及模型大小、推理时间、吞吐量和内存占用等效率指标,并在固定的企业级约束下进行微调。 Result: ALBERT在多个任务中达到最高的任务特定准确率,MiniLM在推理速度和吞吐量方面表现最优,DistilBERT在各项任务中准确率最稳定且效率良好;没有单一模型在所有指标上均占优。 Conclusion: 不同轻量级模型在准确性与效率之间存在明显权衡,建议在延迟敏感场景使用MiniLM,追求平衡性能时选择DistilBERT,而在资源受限环境中优先考虑ALBERT。 Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.[20] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games
Dimitris Vartziotis
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)如何为语言意义理论提供新的实证视角,对比了社会建构主义与数学导向的语义场理论,并提出两者可互补而非对立。
Details
Motivation: 探究大语言模型的成功与局限如何反映语言的数学结构与社会基础之间的关系。 Method: 基于作者先前工作,形式化词汇场和语言场概念,并分析Transformer架构特性(如分布式表示、注意力机制和嵌入空间几何规律)与这些概念的关系。 Result: 发现LLM在捕捉语义规律方面的成功支持语言存在潜在数学结构的观点,而在语用推理和语境敏感性上的局限则印证了语言使用中社会根基的重要性。 Conclusion: 数学结构与语言游戏可视为互补视角,该框架有助于澄清纯统计语言模型的边界,并推动具有理论指导的AI架构发展。 Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.[21] Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
Hyunjun Kim
Main category: cs.CL
TL;DR: 提出Defensive M2S训练范式,通过将多轮对话压缩为单轮形式,显著降低守卫模型的训练和推理开销,同时提升攻击检测召回率。
Details
Motivation: 处理完整的多轮对话历史计算成本高昂,限制了守卫模型在长对话中的可扩展性。 Method: 提出Multi-turn to Single-turn (M2S) 压缩对话的训练范式,并在三种守卫模型和三种压缩模板上进行微调。 Result: M2S将训练复杂度从O(n²)降至O(n),训练用token减少93倍;最佳配置下检测召回率达93.8%,推理token减少94.6%。 Conclusion: M2S压缩是一种高效的守卫模型部署技术,可在显著降低成本的同时提升安全检测性能。 Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.[22] Noise-Aware Named Entity Recognition for Historical VET Documents
Alexander M. Esser,Jens Dörpinghaus
Main category: cs.CL
TL;DR: 本文提出了一种针对职业教育与培训(VET)领域历史文档的命名实体识别(NER)方法,采用噪声感知训练、迁移学习和多阶段微调,在OCR噪声环境下显著提升了鲁棒性和准确性。
Details
Motivation: 历史数字化VET文档存在OCR引入的噪声,传统NER方法在该场景下性能下降,缺乏针对多类型实体识别及噪声鲁棒性的研究。 Method: 提出基于噪声感知训练(NAT)的方法,通过合成注入OCR错误进行训练,并结合迁移学习与多阶段微调;系统比较了在噪声数据、干净数据和人工数据上训练的三种策略。 Result: 实验表明,领域特定且噪声感知的微调显著提高了在噪声条件下的鲁棒性和准确率;该方法适用于德语文档但可推广至任意语言。 Conclusion: 该方法是首批能在VET文档中识别多种实体类型并有效应对OCR噪声的NER方法之一,具备良好的可复现性与跨语言应用潜力。 Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.[23] Rule-Based Approaches to Atomic Sentence Extraction
Lineesha Kamana,Akshita Ananda Subramanian,Mehuli Ghosh,Suman Saha
Main category: cs.CL
TL;DR: 本研究探讨了复杂句子结构(如关系从句、状语从句、并列结构和被动语态)对基于规则的原子句提取性能的影响,使用WikiSplit数据集和spaCy实现依存句法提取规则,并通过ROUGE和BERTScore评估,结果显示系统在处理某些复杂结构时仍面临挑战。
Details
Motivation: 现有基于大语言模型的原子句提取方法缺乏可解释性,且未系统分析哪些特定句法结构导致提取失败,因此需要一项原则性研究来揭示不同复杂结构对提取性能的影响。 Method: 基于spaCy实现依赖句法分析的规则提取系统,在WikiSplit数据集上生成100组标准原子句集,并使用ROUGE和BERTScore进行自动评估。 Result: 系统达到ROUGE-1 F1 = 0.6714,ROUGE-2 F1 = 0.478,ROUGE-L F1 = 0.650,BERTScore F1 = 0.5898;发现关系从句、同位语、并列谓语、状语从句和被动结构是主要难点。 Conclusion: 基于规则的原子句提取在语法复杂的句子中表现敏感,虽有一定准确性,但复杂句法结构仍是主要挑战,未来需针对性优化这些结构的处理。 Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.[24] Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends
Yuelyu Ji,Zhuochun Li,Rui Meng,Daqing He
Main category: cs.CL
TL;DR: 本文提出了一种四维框架来分析多跳问答系统中的检索-推理过程,涵盖执行计划、索引结构、下一步控制和停止标准,并在标准基准上总结了现有方法的权衡与趋势。
Details
Motivation: 现有的多跳问答系统虽然效果良好,但其检索与推理过程往往不透明,难以跨模型比较程序性选择。 Method: 提出一个以执行过程为核心的四轴分析框架(A:执行计划,B:索引结构,C:下一步控制,D:停止/继续标准),并用该框架梳理代表性系统及其在标准数据集上的消融实验与表现趋势。 Result: 通过该框架对主流多跳QA系统进行映射,揭示了在有效性、效率和证据保真度之间的常见权衡,并总结了当前的方法倾向。 Conclusion: 多跳问答系统需要更清晰的执行过程建模;未来挑战包括结构感知的规划、可迁移的控制策略以及在分布偏移下的鲁棒停止机制。 Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.[25] ECR: Manifold-Guided Semantic Cues for Compact Language Models
Chung-Wei Victor Yuan
Main category: cs.CL
TL;DR: 提出了一种名为Embedding Consistency Regulation (ECR)的新框架,通过保持紧凑模型在语义锚点周围的几何一致性来保留嵌入空间的结构,从而改善多语言和低容量场景下的模型压缩效果。
Details
Motivation: 现有模型压缩方法在容量受限或多语言环境下容易导致嵌入空间结构崩塌,引发语义漂移,影响下游任务表现。 Method: ECR框架从教师模型嵌入中提取语义锚点,并让紧凑模型学习维持这些锚点周围的几何一致性,仅需在推理时增加一个小型投影步骤,不改变解码结构。 Result: 在10万样本多语言语料上的实验表明,ECR能稳定训练过程,更好保留语义结构,产生更紧凑且任务对齐的表示空间,优于传统基线方法。 Conclusion: ECR有助于紧凑模型在严格效率或隐私限制下更好地满足任务需求,且无需教师输出,可独立于知识蒸馏使用。 Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.[26] A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
Yuang Zheng,Yuxiang Mei,Dongxing Xu,Jie Chen,Yanhua Long
Main category: cs.CL
TL;DR: 提出了一种轻量级、语言无关的多语言ASR系统HLoRA,基于CTC架构和层次化LoRA-MoE框架,实现无需语言标识的单次解码,显著提升低资源场景下的解码效率。
Details
Motivation: 大规模多语言ASR模型(如Whisper)计算成本高,难以部署在资源受限的边缘设备上,且通常依赖语言标识信息,限制了实际应用。 Method: 提出Language-agnostic Hierarchical LoRA-MoE(HLoRA)框架,集成到mHuBERT-CTC模型中,通过LID后验驱动的LoRA路由机制实现端到端解码;采用层次化设计:共享LoRA学习语言不变特征,语言特异性LoRA专家捕捉语言相关特性,推理时无需语言标签。 Result: 在MSR-86K和MLC-SLM 2025 Challenge数据集上,HLoRA在仅使用单次解码的情况下达到与先进两阶段方法相当的性能,显著提高了低资源多语言ASR的解码效率。 Conclusion: HLoRA实现了真正语言无关的高效多语言ASR,适用于资源受限环境,为低资源多语言语音识别提供了有效解决方案。 Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.[27] InfoSynth: Information-Guided Benchmark Synthesis for LLMs
Ishir Garg,Neel Kolhe,Xuandong Zhao,Dawn Song
Main category: cs.CL
TL;DR: 本文提出了一种基于信息论的自动化推理基准生成框架InfoSynth,利用KL散度和熵来衡量基准的新颖性和多样性,并通过遗传算法和迭代代码反馈生成高质量的Python编程问题。
Details
Motivation: 现有的大语言模型(LLM)基准测试多依赖人工构建,成本高且耗时,同时存在训练数据污染问题,难以准确评估模型的真实推理能力。因此需要一种能自动、高效生成新颖且多样化基准的方法。 Method: 提出InfoSynth框架,使用基于KL散度和熵的信息论指标量化基准的新颖性和多样性;构建端到端管道,结合遗传算法与迭代代码反馈机制,从种子数据集自动生成Python编码问题、测试用例和解决方案,并可调控问题的难度与多样性。 Result: 该方法在97%的情况下成功生成正确的测试用例和解决方案,合成的基准在新颖性和多样性上均高于原始种子数据集,并实现了对问题难度和特性的可控生成。 Conclusion: InfoSynth为大语言模型提供了一个可扩展、自验证的高质量推理与代码生成能力评估基准生成方案,有效解决了传统手动构建效率低和数据污染的问题。 Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/[28] CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns
Zhenhong Zhou,Shilinlu Yan,Chuanpu Liu,Qiankun Li,Kun Wang,Zhigang Zeng
Main category: cs.CL
TL;DR: 本文提出了一个针对中文特有安全威胁的轻量级大语言模型安全评估基准CSSBench,填补了现有英文主导的安全评测在中文场景下的不足。
Details
Motivation: 由于中文恶意查询常使用同音字、拼音、符号分割等特有方式隐藏意图,现有的安全评测基准难以有效评估中文环境下轻量级大语言模型的安全性。 Method: 构建了一个涵盖六类常见中文场景(如违法活动、隐私泄露、医疗误导信息等)的中文特有安全评测基准CSSBench,并设计多种任务类型以评估模型在对抗性扰动下的安全性与过度拒绝行为。 Result: 实验结果显示,中文特有的对抗模式对轻量级大语言模型构成显著挑战,模型普遍存在安全防御不足或过度拒绝问题。 Conclusion: CSSBench为中文环境下的大语言模型提供了更全面、贴近实际的安全评估工具,有助于推动更鲁棒的模型部署。 Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.[29] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
Sumanth Balaji,Piyush Mishra,Aashraya Sachdeva,Suraj Agrawal
Main category: cs.CL
TL;DR: 本文提出了JourneyBench,一个用于评估客户支持中策略感知代理的基准,通过图表示生成真实场景,并提出用户旅程覆盖评分来衡量策略遵守程度,实验表明动态提示代理显著提升策略遵循能力。
Details
Motivation: 现有基准主要关注工具使用或任务完成,忽视了代理在多步骤策略、任务依赖和应对不可预测行为方面的遵守能力,因此需要一个新的评估基准。 Method: 引入JourneyBench基准,利用图结构生成多样且真实的客户支持场景,并设计用户旅程覆盖评分(UJC Score)来量化策略遵守;比较静态提示代理(SPA)和动态提示代理(DPA)在三个领域703次对话中的表现。 Result: 动态提示代理(DPA)显著提高了策略遵守性,甚至使较小模型如GPT-4o-mini超越更强大模型如GPT-4o的表现。 Conclusion: 结构化编排对政策遵循至关重要,JourneyBench为推进超越传统IVR限制的AI驱动客户服务提供了关键资源。 Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.[30] Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs
Nils Rautenberg,Sven Schippkus
Main category: cs.CL
TL;DR: 提出一种模型无关的框架,通过重复采样和LLM-as-a-judge机制,在固定输入任务中以指数级降低大语言模型的上下文幻觉概率。
Details
Motivation: 大语言模型常产生与提示信息矛盾的上下文幻觉,这在需要确定性输出的自动化流程中尤为问题。现有方法难以提供严格的错误概率保证,因此需要一种无需修改模型本身即可有效抑制幻觉的通用方案。 Method: 针对固定输入和确定性正确标准的任务,通过在独立上下文中重复发送相同提示,利用重复生成的结果进行多数投票;引入LLM作为判别器(LLM-as-a-judge)来识别正确答案,并结合多次独立判别结果的多数投票进一步增强判别器可靠性,从而理论上保证整体错误率呈指数下降。 Result: 实验表明,在合成噪声判别器的受控抽取任务中,管道失败率随重复次数指数下降,选择幻觉答案的概率随判别器数量指数减少,理论预测与实验结果完全吻合。 Conclusion: 该方法为固定输入的LLM工作流提供了一种轻量、模块化且理论支持的方式,可将幻觉概率任意压低,且无需修改模型权重、解码策略或提示工程。 Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.[31] Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations
QiWei Meng
Main category: cs.CL
TL;DR: 提出了一种名为Physio-DPO的物理信息对齐框架,用于提升蛋白质语言模型在生成蛋白质设计中的热力学稳定性,减少结构幻觉。
Details
Motivation: 现有的对齐方法(如DPO)将偏好建模为二元标签,忽略了物理能量景观的连续性,导致生成的蛋白质结构不稳定或不真实。 Method: 提出Physio-DPO,引入一种幅度感知目标函数,根据天然结构与物理扰动硬负样本之间的能量差距来调整优化更新,从而将蛋白质语言模型与热力学稳定性对齐。 Result: 实验显示,Physio-DPO在自一致性RMSD(降至1.28 Å)和可折叠性(提升至92.8%)上优于SFT、PPO和标准DPO等基线方法。 Conclusion: Physio-DPO能有效缓解蛋白质生成中的结构性幻觉问题,通过恢复疏水核心堆积和氢键网络等生物物理相互作用,生成更符合物理规律的蛋白质结构。 Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 Å and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.[32] Fast-weight Product Key Memory
Tianyu Zhao,Llion Jones
Main category: cs.CL
TL;DR: 提出了一种名为Fast-weight Product Key Memory (FwPKM) 的新架构,通过动态更新参数实现高效且具备无限存储潜力的序列建模,在长上下文任务中显著降低困惑度,并在未见的长序列上展现出良好泛化能力。
Details
Motivation: 解决现有序列建模中Softmax注意力计算成本高与线性变体存储容量有限之间的权衡问题。 Method: 将静态的Product Key Memory (PKM) 转化为基于局部块级梯度下降动态更新参数的“快速权重” episodic memory 模块,实现训练和推理时的在线参数更新。 Result: FwPKM在长上下文数据集上显著降低困惑度,并在Needle in a Haystack任务中展现出对128K-token上下文的良好泛化能力,尽管仅在4K-token序列上训练。 Conclusion: FwPKM有效平衡了存储能力和计算效率,作为可微分的动态记忆模块,能够与传统语义记忆互补,提升模型对长程依赖的建模能力。 Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.[33] Sigmoid Head for Quality Estimation under Language Ambiguity
Tu Anh Dinh,Jan Niehues
Main category: cs.CL
TL;DR: 提出了一种称为Sigmoid Head的模块,用于在预训练语言模型之上进行质量估计,通过使用sigmoid激活和避免选择可能正确的替代标记来解决softmax限制和单参考训练数据的问题。
Details
Motivation: 语言模型的概率不是一个可靠的品质估计器,因为自然语言具有歧义性,当多个输出选项都有效时,模型的概率分布会分散到这些选项上,这可能会误导地表明输出质量较低。 Method: 提出了一个额外的非嵌入头(Sigmoid Head),该头使用sigmoid激活函数,并在负采样过程中采用启发式方法避免选择潜在的正确替代令牌。 Result: Sigmoid Head在训练和推理期间计算效率高,其概率作为质量信号明显优于原始的softmax头。 Conclusion: 由于Sigmoid Head不依赖于人工标注的质量数据,因此相较于监督式质量估计,它在域外设置下更加稳健。 Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model's probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs' final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs' training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.[34] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
Alphaeus Dmonte,Roland Oruche,Tharindu Ranasinghe,Marcos Zampieri,Prasad Calyam
Main category: cs.CL
TL;DR: 本文评估了大型语言模型(LLM)在情感分析、攻击性语言识别和主张验证三个任务中的文本片段识别性能,探索了指令调优、上下文学习和思维链等策略,发现文本内部的潜在关系有助于LLM精确定位文本片段。
Details
Motivation: 当前对LLM在主观性较强的文本片段识别任务(如基于方面的情感分析)中的应用研究不足,本文旨在填补这一空白。 Method: 采用多种LLM,结合指令调优、上下文学习和思维链等策略,在三个NLP任务中进行文本片段识别的实验评估。 Result: 实验结果表明,利用文本内部的潜在关系,LLM能够更准确地识别相关文本片段,尤其在结合上下文信息和推理策略时表现更优。 Conclusion: LLM在主观性文本片段识别任务中具有潜力,合理的提示和推理策略可显著提升其性能,为模型可解释性提供支持。 Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.[35] Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries
Jonathan Simkin,Lovedeep Gondara,Zeeshan Rizvi,Gregory Doyle,Jeff Dowden,Dan Bond,Desmond Martin,Raymond Ng
Main category: cs.CL
TL;DR: 本研究首次评估了在加拿大不同省份间适应基于Transformer的NLP模型(BCCRTron和GatorTron)用于癌症监测的可行性,结果显示通过轻量微调和隐私保护的权重共享策略,跨区域迁移可行且显著减少漏检。
Details
Motivation: 病理报告是癌症登记的主要信息来源,但人工摘录耗时耗力。现有NLP模型在不同地区间的泛化能力尚不明确,亟需评估其跨 jurisdiction 适应性以提升全国癌症数据整合效率。 Method: 使用来自纽芬兰与拉布拉多癌症登记处(NLCR)约10.4万份和2.2万份去标识化病理报告,分别训练Tier 1(癌症/非癌症)和Tier 2(可报告/不可报告)任务。采用针对小结和诊断部分的不同输入管道对BCCRTron和GatorTron进行微调,并构建保守OR集成模型。仅共享模型权重以保护隐私。 Result: 在NLCR测试集上,单个模型性能良好,集成后显著提升:Tier 1召回率达0.99,漏检数从54和48降至24;Tier 2召回率达0.99,漏报从54和46降至33。表明跨省预训练模型可通过少量微调有效本地化。 Conclusion: 结合互补文本表征的集成方法能显著降低癌症漏检率并改善错误覆盖,提出的隐私保护模型权重共享机制支持构建可互操作的国家级NLP基础设施,为未来泛加拿大癌症病理基础模型奠定基础。 Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.cs.CV [Back]
[36] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
Yabo Chen,Yuanzhi Liang,Jiepeng Wang,Tingxi Chen,Junfei Cheng,Zixiao Gu,Yuyang Huang,Zicheng Jiang,Wei Li,Tian Li,Weichen Li,Zuoxin Li,Guangce Liu,Jialun Liu,Junqi Liu,Haoyuan Wang,Qizhen Weng,Xuan'er Wu,Xunzhi Xiang,Xiaoyan Yang,Xin Zhang,Shiwen Zhang,Junyu Zhou,Chengcheng Zhou,Haibin Huang,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: TeleWorld提出了一种实时多模态4D世界建模框架,通过生成-重建-引导范式和分层规划方法实现动态场景的长期一致性与低延迟生成。
Details
Motivation: 现有视频生成模型在实时交互、长时序一致性和动态场景记忆方面存在局限,难以成为实用的世界模型。 Method: 提出TeleWorld框架,采用生成-重建-引导范式,结合自回归扩散视频模型、宏-微规划(MMPL)和分布匹配蒸馏(DMD),实现闭环系统的4D动态建模。 Result: 在静态与动态场景理解、长时序一致性和实时生成效率方面表现优异,支持动态物体与静态场景的统一建模。 Conclusion: TeleWorld推动了具备记忆能力、可交互且计算高效的实用化世界模型的发展,为多模态生成与具身智能提供了可行路径。 Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.[37] It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Anne Harrington,A. Sophia Koepke,Shyamgopal Karthik,Trevor Darrell,Alexei A. Efros
Main category: cs.CV
TL;DR: 本文提出通过噪声优化来缓解文本到图像模型中的模式崩溃问题,提升生成图像的多样性和质量。
Details
Motivation: 现有的文本到图像模型存在严重的模式崩溃问题,导致相同提示下生成的图像缺乏多样性。 Method: 采用简单的噪声优化目标,并分析噪声的频率特性,探索不同频率配置的噪声初始化对优化和搜索的影响。 Result: 实验表明,噪声优化能有效减少模式崩溃,在保持模型保真度的同时显著提高生成结果的质量和多样性。 Conclusion: 噪声优化是一种有效且可行的方法,可用于增强文本到图像模型的生成多样性。 Abstract: Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.[38] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
Pan Wang,Yang Liu,Guile Wu,Eduardo R. Corral-Soto,Chengjie Huang,Binbin Xu,Dongfeng Bai,Xu Yan,Yuan Ren,Xingxin Chen,Yizhe Wu,Tao Huang,Wenjun Wan,Xin Wu,Pei Zhou,Xuyang Dai,Kangbo Lv,Hongbo Zhang,Yosef Fried,Aixue Ye,Bailan Feng,Zhenyu Chen,Zhen Li,Yingcong Chen,Yiyi Liao,Bingbing Liu
Main category: cs.CV
TL;DR: 本文提出了Spatial4D-Bench,一个大规模、多任务的4D空间智能基准,用于全面评估多模态大语言模型(MLLMs)在时空推理方面的能力,并揭示了现有模型在路线规划、动作识别和物理合理性推理等方面的局限性。
Details
Motivation: 探索MLLMs在人类水平的4D空间智能方面的潜力,并弥补现有空间智能基准规模小、多样性不足的问题。 Method: 构建了一个包含约4万个问答对、覆盖18个任务的大型基准Spatial4D-Bench,并将任务系统地划分为六个认知类别,以全面评估MLLMs的4D空间认知能力。 Result: 在多个最先进的开源和专有MLLM上进行评测,发现它们在多种4D空间推理任务中表现有限,尤其在路线规划、动作识别和物理合理性判断方面存在明显不足。 Conclusion: Spatial4D-Bench为评估和推动MLLMs向人类水平的4D空间智能发展提供了有价值的工具和洞察。 Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.[39] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data
Hyunho Lee,Wenwen Li
Main category: cs.CV
TL;DR: 提出了一种名为SMAGNet的多模态深度学习模型,利用SAR数据为主输入、MSI数据为辅,通过特征融合实现洪水后水体范围制图,具有更强的缺失数据鲁棒性和实际应用潜力。
Details
Motivation: 现有方法在SAR与MSI数据融合用于洪水后水体范围制图方面存在研究空白,尤其是当MSI数据不完整或缺失时的自适应集成问题尚未充分探索。 Method: 提出空间掩码自适应门控网络(SMAGNet),以SAR数据为主要输入,通过特征级融合引入MSI数据,并设计机制应对MSI数据部分或完全缺失的情况。 Result: 在C2S-MS Floods数据集上实验表明,SMAGNet在不同MSI数据可用性水平下均优于其他多模态模型;即使MSI数据完全缺失,其性能仍与仅使用SAR的U-Net相当。 Conclusion: SMAGNet不仅提升了多模态模型在洪水制图中的准确性,还增强了对缺失数据的鲁棒性,推动了其在真实洪水管理场景中的应用。 Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.[40] Compressed Map Priors for 3D Perception
Brady Zhou,Philipp Krähenbühl
Main category: cs.CV
TL;DR: 提出了一种名为Compressed Map Priors (CMP)的框架,利用历史行驶数据学习空间先验,显著提升3D目标检测性能,存储开销低且易于集成到现有系统中。
Details
Motivation: 现有自动驾驶视觉系统通常忽略历史行驶数据,将每个位置视为首次访问,未能充分利用重复经过同一区域的信息。 Method: 设计了一个基于二值化哈希图的压缩地图先验(CMP)框架,以极低存储代价(32KB/km²)保存和利用历史轨迹中的空间先验信息,并将其集成到主流3D感知系统中。 Result: 在nuScenes数据集上,多种3D检测架构均取得显著且一致的性能提升,同时存储需求比密集存储方式降低20倍。 Conclusion: CMP是一种高效、实用的方法,能够通过利用历史行驶数据增强自动驾驶系统的感知能力,具有良好的应用前景。 Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.[41] Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection
Lawrence Han
Main category: cs.CV
TL;DR: 本文提出了一种名为GLASS的架构,结合全局调整大小视图与多个随机采样的局部裁剪,用于检测AI生成图像,有效保留了高分辨率细节并提升了检测性能。
Details
Motivation: 由于大多数AI生成图像检测模型在输入前会下采样图像,可能导致细微伪造痕迹丢失,因此需要一种能保留原始分辨率细节的检测方法。 Method: GLASS采用全局-局部注意力机制,通过空间分层采样选取原始分辨率的局部区域,并利用注意力机制进行特征聚合,可集成到任意视觉模型中处理任意尺寸图像。 Result: 实验表明,基于Vision Transformer、ResNet和ConvNeXt等骨干网络,GLASS在计算可行的前提下,检测性能优于标准迁移学习方法。 Conclusion: GLASS能有效融合全局与局部信息,在不牺牲计算效率的情况下显著提升AI生成图像的检测准确率,适用于多种模型和图像尺寸。 Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.[42] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications
Yehui Yang,Dalu Yang,Wenshuo Zhou,Fangxin Shang,Yifan Liu,Jie Ren,Haojun Fei,Qing Yang,Tao Chen
Main category: cs.CV
TL;DR: 本文提出了FCMBench-V1.0,一个面向金融信贷场景的大规模多模态基准,涵盖18类证书、4,043张隐私合规图像和8,446个问答样本,评估维度包括感知、推理与鲁棒性,并通过闭合合成-采集流程确保数据合规与真实性。实验表明现有视觉语言模型在该基准上仍有显著性能下降。
Details
Motivation: 现有的多模态AI基准缺乏对金融信贷领域特定文档、理解需求和现实鲁棒性的覆盖,同时难以兼顾隐私合规与实用性,亟需一个专门的、贴近真实应用场景的评估基准。 Method: 构建了一个包含感知、推理和鲁棒性三个维度的评估框架,涵盖3项基础感知任务、4项信贷特定推理任务和10种现实采集伪影类型;采用闭合的合成-采集流程生成数据:人工设计含虚拟内容的文档模板并在受控环境下拍摄,避免数据泄露并保障隐私合规。 Result: 在23个最先进的视觉语言模型上进行了广泛实验,Gemini 3 Pro商业模型取得最高F1得分(64.61),开源基线Qwen3-VL-235B表现最佳(57.27),而本文提出的金融专用模型Qfin-VL-Instruct整体最优(64.92);鲁棒性测试显示即使顶级模型在面对采集伪影时性能也明显下降。 Conclusion: FCMBench-V1.0有效填补了金融信贷多模态评估的空白,能够区分不同VLM在真实信贷场景下的性能差异与鲁棒性缺陷,推动面向垂直领域的安全、可靠多模态AI发展。 Abstract: As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.[43] Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions
Kaiwen Zheng,Junchen Fu,Songpei Xu,Yaoqing He,Joemon M. Jose,Han Hu,Xuri Ge
Main category: cs.CV
TL;DR: 本文提出了一种新的面部区域多属性自然语言描述生成与识别方法(FaceFocalDesc),通过构建新数据集和提出Focal-RegionFace模型,在细粒度面部分析任务中实现了更优的性能。
Details
Motivation: 现有的面部分析方法通常缺乏对任意选定面部区域进行多属性(如面部动作单元、情绪状态、年龄)自然语言描述的能力,限制了理解和控制的精细度。本文旨在解决这一未被充分探索的问题。 Method: 构建了一个包含丰富区域级标注和自然语言描述的新数据集,并基于Qwen2.5-VL模型提出了Focal-RegionFace模型,通过多阶段渐进式微调,逐步细化对局部面部特征的关注,实现可解释的年龄估计、面部动作单元和情绪检测。 Result: 实验结果表明,Focal-RegionFace在传统指标和新提出的指标上均在新基准上取得了最佳性能。 Conclusion: 所提方法在细粒度多属性面部区域聚焦分析场景中具有有效性和通用性,验证了关注局部区域对面部理解的重要性。 Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.[44] DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery
Salma Gonzalez-Sabbagh,Antonio Robles-Kelly,Shang Gao
Main category: cs.CV
TL;DR: 提出DichroGAN,一种基于条件生成对抗网络的两步训练方法,用于从卫星图像中恢复海底的在空气中的颜色,通过估计漫反射、镜面反射和水下光传输来消除水体对光的吸收与散射影响。
Details
Motivation: 由于光线在水柱中随深度呈指数衰减,从卫星图像中恢复海底在空气中的真实颜色极具挑战性。 Method: 采用条件生成对抗网络(cGAN),分两步同时训练:前两个生成器从高光谱图像立方体估计漫反射和镜面反射以获得大气场景辐射;第三和第四个生成器分别处理场景辐射和水下光传输,依据水下成像方程恢复在空气中颜色。 Result: 在PRISMA卫星图像构建的小型数据集上训练,并在卫星与水下数据集上实验,结果表明DichroGAN性能优于或媲美当前最先进的水下图像恢复技术。 Conclusion: DichroGAN能有效恢复海底在空气中的颜色,为利用卫星遥感进行海底观测提供了新思路。 Abstract: Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.[45] MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Xiaokun Sun,Zeyu Cai,Hao Tang,Ying Tai,Jian Yang,Zhenyu Zhang
Main category: cs.CV
TL;DR: MorphAny3D是一种无需训练的3D形态变换框架,利用结构化潜在表示和改进的注意力机制实现高质量、跨类别的语义一致且时间平滑的3D形变。
Details
Motivation: 现有的3D形态变换方法在生成语义一致且时间连续的变形(尤其是跨类别)方面存在困难,缺乏有效控制结构与时间一致性的手段。 Method: 提出MorphAny3D框架,引入结构化潜在(SLAT)表示,并设计Morphing Cross-Attention(MCA)融合源与目标特征以保持结构连贯性,以及Temporal-Fused Self-Attention(TFSA)增强时序一致性;同时采用姿态校正策略缓解姿态模糊问题。 Result: 实验表明该方法在多种场景下均能生成最先进的3D形变序列,尤其在跨类别任务中表现优异,并支持解耦形变和3D风格迁移等扩展应用。 Conclusion: MorphAny3D通过结合SLAT表示与新型注意力机制,实现了高质量、无需训练的3D形态变换,具备良好的泛化能力与实际应用潜力。 Abstract: 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.[46] CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting
Md Ahmed Al Muzaddid,William J. Beksi
Main category: cs.CV
TL;DR: 提出了一种基于多视角2D图像和NeRF的3D实例分割框架,用于精确作物计数,克服了遮挡和聚类作物难以区分的问题。
Details
Motivation: 在户外环境中,由于部分遮挡和作物聚集导致的模糊性,传统图像分割方法难以准确计数作物。 Method: 利用多视角2D图像生成独立实例掩码,结合神经辐射场(NeRF)进行视图合成,并引入作物可见性和掩码一致性评分,融合3D信息实现精确的3D实例分割。 Result: 在棉花、苹果和梨三个农业数据集上验证了方法的有效性,表现出对不同颜色、形状和大小作物的稳定计数性能,且无需针对特定作物调参,优于现有最先进方法。 Conclusion: 该框架实现了高精度的作物计数,具有良好的泛化能力,减少了对人工干预的依赖,推动了农业自动化管理的发展。 Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.[47] IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation
Han Liu,Yubo Fan,Hao Li,Dewei Hu,Daniel Moyer,Zhoubing Xu,Benoit M. Dawant,Ipek Oguz
Main category: cs.CV
TL;DR: 本文提出了一种无需先验知识即可捕捉多样化域内风格的示例引导风格合成方法IntraStyler,通过对比学习提取仅风格特征,并在CrossMoDA 2023数据集上验证了其在可控风格合成和下游分割任务中的有效性。
Details
Motivation: 现有无监督域适应方法主要关注源域与目标域之间的域偏移,而忽略了域内变异性的探索,且以往方法需要预设域内变化进行风格合成,缺乏实用性。 Method: 提出IntraStyler方法,使用示例图像指导风格合成以匹配其风格;引入基于对比学习的风格编码器,用于判别式地学习仅风格特征,实现无需先验知识的多样化风格生成。 Result: 在CrossMoDA 2023数据集上验证了方法的有效性,展示了可控的风格合成能力,并证明多样化的合成数据有助于提升下游分割性能。 Conclusion: IntraStyler能够有效捕捉并生成多样化的域内风格,无需任何先验信息,在跨模态域适应中提升了图像级域对齐的效果和分割性能。 Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.[48] From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
Omar Sharif,Eftekhar Hossain,Patrick Ng
Main category: cs.CV
TL;DR: 本文研究了如何通过奖励驱动的强化学习来提升多模态大语言模型在视觉推理任务中的表现,特别是在缺乏高质量视觉信息整合的情况下。通过设计六种针对不同推理方面的奖励函数,并使用群体相对策略优化(GRPO),实现了对长视觉推理链的解锁,在Qwen-2.5-VL-7B上取得了显著且稳定的性能提升。
Details
Motivation: 现有的多模态大语言模型在生成推理链时往往忽略视觉信息,导致在需要精确视觉感知的任务(如视觉谜题)中表现不佳。本文旨在解决这一关键瓶颈。 Method: 提出基于强化学习的方法,设计六种奖励函数,涵盖图像理解、思维步骤和答案准确性等方面,采用群体相对策略优化(GRPO)来鼓励更长、结构化的推理过程,避免绕过视觉信息。 Result: 在Qwen-2.5-VL-7B模型上实现了比基线模型高出5.56%的性能提升,在领域内和跨领域设置下均有稳定增益;将图像转为文本描述后,Claude 3.5和Claude 3.7分别提升26.7%和23.6%。 Conclusion: 奖励驱动的强化学习能有效促进开源多模态大语言模型进行长视觉推理,无需昂贵监督即可显著提升视觉感知与推理能力。 Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.[49] LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
Jie Li,Kwan-Yee K. Wong,Kai Han
Main category: cs.CV
TL;DR: 本文提出了一种新的向量量化方法LooC,通过低维组合式码本和插值外推机制,在保持紧凑的同时实现了高性能和高容量,显著优于现有方法。
Details
Motivation: 随着数据和模型的复杂性增加,传统向量量化方法在容量和紧凑性之间存在矛盾,亟需一种既能提高容量又能保持紧凑的新方法。 Method: LooC通过重构码向量与特征向量的关系,使用低维组合单元构建参数高效的码本,并引入无需额外参数的插值外推机制来增强和优化特征逼近过程。 Result: LooC实现了完整的码本利用率,避免了码本坍塌问题,并在多个任务、数据集和架构上表现出优于现有方法的性能,且使用更小的码本达到最先进水平。 Conclusion: LooC有效解决了高容量与紧凑性之间的冲突,可作为即插即用模块应用于基于向量量化的各种下游任务,具有广泛的应用前景。 Abstract: Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook. As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods. This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization. Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space. Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance. Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.[50] Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions
Aobo Li,Jinjian Wu,Yongxu Liu,Leida Li,Weisheng Dong
Main category: cs.CV
TL;DR: 提出了一种新的合成数据分布重塑框架SynDR-IQA,以提升盲图像质量评估模型的泛化能力,通过多样性上采样和冗余簇下采样策略改善特征表示。
Details
Motivation: 现有合成数据集训练的BIQA模型泛化能力差,特征表示呈现离散和聚类现象,影响回归性能。 Method: 提出SynDR-IQA框架,采用分布感知的多样内容上采样和密度感知的冗余簇下采样策略,基于样本多样性和冗余对泛化误差的理论分析,优化合成数据分布。 Result: 在三种跨数据集设置下进行了广泛实验,结果表明所提方法显著提升了模型的泛化性能。 Conclusion: 合成数据的分布问题是限制BIQA模型泛化的关键,SynDR-IQA通过调整数据分布有效缓解了该问题,为未来利用合成数据训练高质量BIQA模型提供了新思路。 Abstract: Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.[51] Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection
Chao Yang,Haoyuan Zheng,Yue Ma
Main category: cs.CV
TL;DR: 提出一种基于CycleGAN和YOLOv8的跨模态数据增强框架,通过非配对图像翻译生成伪红外图像,缓解PCB缺陷检测中红外数据稀缺的问题。
Details
Motivation: 红外数据稀缺制约了PCB缺陷检测性能,现有方法依赖配对数据或有限的真实红外样本,难以满足深度学习对大规模标注数据的需求。 Method: 采用CycleGAN实现可见光到红外图像的非配对转换,生成保持缺陷结构语义和热分布特征的伪红外图像;结合真实红外数据与生成数据,采用异构训练策略微调轻量化的YOLOv8检测器。 Result: 实验表明,该方法在低数据条件下显著提升特征学习效果,检测性能远超仅使用真实红外数据训练的模型,并接近全监督基准水平。 Conclusion: 伪红外图像生成是一种有效的数据增强策略,可显著缓解工业检测中跨模态数据不足的瓶颈,具有实际应用潜力。 Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.[52] Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture
Anirudha Ghosh,Ritam Sarkar,Debaditya Barman
Main category: cs.CV
TL;DR: 提出了一种轻量级的害虫检测与农药推荐框架,适用于低资源设备,结合轻量CNN与元学习实现少样本害虫识别,并融合环境因素推荐环保农药,实验显示其高效、准确且适合精准农业应用。
Details
Motivation: 传统害虫管理方法依赖人工巡查和化学农药,成本高、耗时、劳动强度大且对环境有害,亟需一种低成本、高效且可持续的替代方案,尤其服务于资源有限的小农户。 Method: 构建一个包含害虫检测模块和农药推荐模块的轻量级框架:检测模块采用紧凑型卷积神经网络(CNN)结合原型元学习,实现小样本下的准确识别;推荐模块结合作物类型、生长阶段等环境因素,推荐安全环保的农药;通过整合多个公开数据集构建多样化害虫图像数据集用于训练与评估。 Result: 所提出的轻量CNN在精度上可媲美现有最先进模型,同时显著降低计算复杂度;系统在不同视角、害虫尺寸和背景条件下均表现出良好的泛化能力;决策支持系统有效减少对化学农药的依赖,支持可持续农业实践。 Conclusion: 该框架在保持高检测精度的同时具备低计算开销,适用于智能手机和无人机等边缘设备,为小农户提供了可行的智能害虫管理解决方案,具有广阔的实际应用前景,有助于推动精准农业和可持续发展。 Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.[53] TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models
Kohei Yamamoto,Tomohiro Kikuchi
Main category: cs.CV
TL;DR: 本文提出了一种名为TotalFM的放射学基础模型,通过器官分离策略高效学习3D-CT图像与文本描述之间的对应关系,在大规模数据集上实现了优异的零样本分类性能和报告生成能力。
Details
Motivation: 训练基于3D-CT体积数据的基础模型面临计算成本高的挑战,且需要有效关联影像与临床文本。因此,亟需一种兼顾计算效率与表征能力的放射学基础模型。 Method: 提出TotalFM模型,采用器官分离的学习框架;利用分割技术与大语言模型自动构建器官体积-发现语句配对数据集(14万序列);结合VideoMAE自监督预训练与基于体积-文本对的对比学习进行训练。 Result: 在零样本器官级病变分类中,83%器官F1分数优于CT-CLIP,64%优于Merlin;在发现级分类中,83%类别AUROC高于Merlin;在报告生成任务中性能与现有视觉-语言模型相当。 Conclusion: 器官分离的学习框架在保持计算效率的同时提升了模型泛化能力,为3D-CT基础模型的实际部署提供了可行且高效的设计范式。 Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.[54] S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
He Wang,Longteng Guo,Pengkang Huo,Xuanxu Lin,Yichen Yuan,Jie Jiang,Jing Liu
Main category: cs.CV
TL;DR: S1-MMAlign是一个大规模、多学科的科学图像-文本对数据集,包含1550万对高质量样本,并通过Qwen-VL模型增强语义对齐,显著提升科学多模态学习的数据质量。
Details
Motivation: 科学图像与稀疏文本描述之间存在严重语义鸿沟,限制了多模态学习在科研发现中的应用。 Method: 从250万篇开放获取论文中构建超过1550万图像-文本对的数据集,并利用Qwen-VL多模态大模型结合摘要和引用上下文生成更准确的图像描述。 Result: 伪困惑度指标显示语义模糊性降低,CLIP分数表明图像-文本对齐度提升18.21%。 Conclusion: S1-MMAlign为AI驱动的科学发现提供了高质量、强语义对齐的多模态基础资源。 Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.[55] ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching
Yi Sun,Xinhao Zhong,Hongyan Li,Yimin Zhou,Junhao Li,Bin Chen,Xuan Wang
Main category: cs.CV
TL;DR: 提出了一种无需训练的高效概念擦除方法ActErase,通过激活差异分析实现对文本到图像扩散模型中敏感概念的移除,同时保持模型整体生成能力。
Details
Motivation: 现有概念擦除方法依赖于数据密集且计算成本高的微调过程,限制了其应用;需要一种更高效、轻量级的方法来解决安全、版权和伦理问题。 Method: 基于提示词对分析识别激活差异区域,提取目标概念的激活并动态替换前向传播中的输入激活,整个过程无需额外训练。 Result: 在裸露内容、艺术风格和物体移除三个关键任务上达到最先进性能,有效去除目标概念的同时保留模型生成质量,并表现出对对抗攻击的强鲁棒性。 Conclusion: ActErase为扩散模型中的概念操作提供了一种新的即插即用范式,兼具轻量化与高效性,具有广泛的应用前景。 Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.[56] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering
Chaodong Tong,Qi Zhang,Chen Li,Lei Jiang,Yanbing Liu
Main category: cs.CV
TL;DR: 本文提出FaithSCAN,一种利用视觉语言模型内部信号(如解码不确定性、中间视觉表示和跨模态对齐特征)来检测VQA中忠实性幻觉的轻量级方法,通过分支证据编码和不确定性感知注意力融合信号,并采用低成本自动生成监督信号的策略,实验证明其在多个基准上显著优于现有方法。
Details
Motivation: 现有的VQA幻觉检测方法存在计算开销高、依赖外部资源或未能充分挖掘模型内部信号的问题,导致效率、鲁棒性和检测性能受限,因此需要一种更高效且准确的检测方法。 Method: 提出FaithSCAN,利用VLM的token级解码不确定性、中间视觉表示和跨模态对齐特征作为内部信号,通过分支式证据编码和不确定性感知注意力机制进行融合;并扩展LLM-as-a-Judge范式,设计了一种低代价的模型依赖监督信号自动生成策略用于训练。 Result: 实验表明FaithSCAN在多个VQA基准上显著优于现有幻觉检测方法,兼具更高效率与准确性;分析揭示了幻觉源于视觉感知、跨模态推理和语言解码中的系统性内部状态变化,不同内部信号提供互补的诊断线索,且不同VLM架构的幻觉模式存在差异。 Conclusion: FaithSCAN通过有效利用VLM内部多维度信号实现了高效准确的幻觉检测,无需依赖外部资源即可完成训练,为理解多模态幻觉的成因提供了新视角,并推动了安全关键应用中VLM可靠性的发展。 Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.[57] Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification
Chi Ding,Junxiao Xue,Xinyi Yin,Shi Chen,Yunyun Shi,Yiduo Wang,Fengjian Xue,Xuecheng Wu
Main category: cs.CV
TL;DR: 提出了一种基于证据深度学习的模型无关框架DUAL,将预测不确定性分解为认知不确定性和偶然不确定性,以区分难样本和噪声数据,提升遥感中长尾分布问题的性能。
Details
Motivation: 在遥感中,由于地物对象的出现频率具有长尾分布特性,现有方法难以有效区分难学的尾部样本与含噪声的模糊样本,导致对噪声过拟合。 Method: 基于证据深度学习,提出DUAL框架,动态将不确定性分解为认知不确定性(EU)和偶然不确定性(AU);利用EU指导对难学尾部样本的重加权,利用AU实现自适应标签平滑以抑制噪声影响。 Result: 在多个数据集和骨干网络上实验表明,该方法优于TGN、SADE等强基线,具有良好的泛化能力。 Conclusion: DUAL能有效解耦样本稀缺与数据模糊性,通过不确定性分解提升了长尾分布下的模型性能,且具备模型无关性和实际应用潜力。 Abstract: Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.[58] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
Jun-Jee Chao,Volkan Isler
Main category: cs.CV
TL;DR: 本文提出SV-GS框架,用于在稀疏观测下进行大范围动态目标的重建,通过结合骨架驱动的变形场实现运动估计与几何细节保持。
Details
Motivation: 现有动态重建方法依赖密集多视角视频,在真实场景中因时序和视角稀疏而难以应用,因此需要一种能在稀疏观测下有效工作的重建方法。 Method: SV-GS利用粗略骨架图和初始静态重建作为引导,优化一个由骨架关节姿态估计器和细粒度变形模块组成的骨架驱动变形场;仅使关节姿态估计器具有时间依赖性,以实现平滑插值并保留几何细节。 Result: 在合成数据上比现有方法提升高达34% PSNR,在真实世界数据上用更少帧数达到与密集单目视频方法相当的性能,并验证了可用扩散先验替代初始静态输入。 Conclusion: SV-GS在稀疏观测条件下实现了高质量的动态重建,且通过引入生成先验提升了实际应用性,为真实场景中的动态重建提供了可行方案。 Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.[59] Towards Automated Differential Diagnosis of Skin Diseases Using Deep Learning and Imbalance-Aware Strategies
Ali Anaissi,Ali Braytee,Weidong Huang,Junaid Akram,Alaa Farhat,Jie Hua
Main category: cs.CV
TL;DR: 本项目开发了一种基于深度学习的皮肤疾病分类模型,采用Swin Transformer架构,在ISIC2019数据集上对八类皮肤病变实现了87.71%的准确率。
Details
Motivation: 由于皮肤病日益普遍而皮肤科医生资源有限,亟需智能工具辅助患者和临床医生进行及时准确的诊断。 Method: 利用公开的皮肤病图像数据集进行预训练,采用Swin Transformer模型架构,并优化数据预处理流程和应用针对性的数据增强技术。 Result: 模型在ISIC2019数据集上的八类皮肤病变分类任务中达到了87.71%的预测准确率。 Conclusion: 该模型具备作为临床诊断辅助工具和患者自评支持工具的应用潜力。 Abstract: As dermatological conditions become increasingly common and the availability of dermatologists remains limited, there is a growing need for intelligent tools to support both patients and clinicians in the timely and accurate diagnosis of skin diseases. In this project, we developed a deep learning based model for the classification and diagnosis of skin conditions. By leveraging pretraining on publicly available skin disease image datasets, our model effectively extracted visual features and accurately classified various dermatological cases. Throughout the project, we refined the model architecture, optimized data preprocessing workflows, and applied targeted data augmentation techniques to improve overall performance. The final model, based on the Swin Transformer, achieved a prediction accuracy of 87.71 percent across eight skin lesion classes on the ISIC2019 dataset. These results demonstrate the model's potential as a diagnostic support tool for clinicians and a self assessment aid for patients.[60] TimeColor: Flexible Reference Colorization via Temporal Concatenation
Bryan Constantine Sadihin,Yihao Meng,Michael Hua Wang,Matteo Jiahao Chen,Hang Su
Main category: cs.CV
TL;DR: TimeColor是一种基于草图的视频着色模型,支持异构、可变数量的参考输入,通过显式区域分配和时空注意力机制提升着色保真度、身份一致性和时间稳定性。
Details
Motivation: 现有着色模型通常仅依赖单个参考帧,忽略了角色表、背景图或多帧着色图像等其他有用条件信息,导致颜色泄漏和身份混淆问题。 Method: TimeColor将多个参考图像编码为附加的潜在帧,并在时间维度上拼接,使其在每一步扩散过程中被同时处理;引入时空对应掩码注意力机制以实现主体与参考的绑定,并采用模态分离的RoPE索引防止跨身份颜色泄漏。 Result: 在SAKUGA-42M数据集上的实验表明,TimeColor在单参考和多参考设置下均优于先前方法,显著提升了着色保真度、身份一致性和时间稳定性。 Conclusion: TimeColor通过灵活利用多种参考信息并结合新型注意力机制,有效解决了视频着色中的颜色泄漏和一致性问题,为复杂场景下的高质量着色提供了新思路。 Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.[61] VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning
Anns Ijaz,Muhammad Azeem Javed
Main category: cs.CV
TL;DR: 本文提出了一种高效且有效的行人重识别模型VisNet,通过多尺度特征融合、基于解剖分区的语义聚类和动态权重平均等技术,在保持低计算成本的同时实现了良好的性能。
Details
Motivation: 现有的行人重识别方法虽然精度较高,但计算开销大,难以在资源受限的实际场景中部署,因此需要一种兼顾准确性和效率的解决方案。 Method: 提出VisNet模型,采用ResNet50多阶段多尺度特征融合并结合自动注意力机制,引入基于规则伪标签的语义聚类以施加空间约束,并使用动态权重平均和FIDI损失函数优化分类与度量学习。 Result: 在Market-1501数据集上达到87.05%的Rank-1精度和77.65%的mAP,模型参数量为32.41M,计算量为4.601 GFLOPs。 Conclusion: VisNet在较低计算成本下实现了具有竞争力的性能,适合用于监控和移动设备等资源受限场景中的实时行人重识别。 Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.[62] ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition
Feng-Qi Cui,Jinyang Huang,Sirui Zhao,Jinglong Guo,Qifan Cai,Xin Yan,Zhi Liu
Main category: cs.CV
TL;DR: 提出了一种新的视频行为识别数据增强方法ReMA,通过控制混合过程来提升表示的稳定性和判别性。
Details
Motivation: 现有视频数据增强方法引入了不可控变化,削弱了类内结构和表示稳定性。 Method: 设计了Representation-aware Mixing Augmentation (ReMA),包含表示对齐机制(RAM)和动态选择机制(DSM),实现类内混合与扰动定位。 Result: 在多个视频行为识别基准上验证了ReMA的有效性,提升了不同时空粒度下的泛化性和鲁棒性。 Conclusion: ReMA是一种即插即用且无需额外监督或可训练参数的数据增强策略,能有效增强视频表示学习。 Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.[63] Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation
Siyan Fang,Long Peng,Yuntao Wang,Ruonan Wei,Yuehuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于日间和夜间图像反射分离的深度记忆解耦网络(DMDNet),通过引入深度感知扫描和状态空间模型,结合跨图像历史知识,在新构建的NightIRS数据集上实现了优于现有方法的性能。
Details
Motivation: 现有单图像反射分离方法在对比度相似时难以区分透射层和反射层,尤其在夜间问题更为严重,缺乏有效利用结构信息和外部知识的机制。 Method: 提出DMDNet,包含深度感知扫描(DAScan)引导Mamba沿语义一致性传播信息,深度协同状态空间模型(DS-SSM)根据深度调节状态激活敏感性,并设计记忆专家补偿模块(MECM)利用跨图像历史知识进行分层补偿。同时构建了NightIRS夜间反射分离数据集。 Result: 实验表明DMDNet在白天和夜间反射分离任务中均优于现有最先进方法,尤其在夜间场景下表现出更强的鲁棒性和分离精度。 Conclusion: DMDNet通过融合深度信息、状态调控与历史知识补偿,有效提升了反射与透射层的解耦能力,为复杂光照条件下的图像反射分离提供了新的解决方案。 Abstract: Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.[64] HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection
Naiqi Zhang,Chuancheng Shi,Jingtong Dou,Wenhua Wu,Fei Shen,Jianhua Cao
Main category: cs.CV
TL;DR: 提出HarmoniAD,一种频域引导的双分支框架,通过分离高频和低频特征来协同建模结构与语义,实现工业异常检测中细粒度细节与全局语义的平衡。
Details
Motivation: 现有方法在结构与语义之间存在权衡:面向结构的模型对噪声敏感,而面向语义的模型容易遗漏细微缺陷。需要一种能同时捕捉精细结构和全局语义信息的方法以提升微小缺陷检测能力。 Method: 利用CLIP图像编码器提取特征,将其变换到频域后解耦为高、低频分支;高频分支采用细粒度结构注意力模块(FSAM)增强纹理和边缘,低频分支使用全局结构上下文模块(GSCM)捕获长距离依赖并保持语义一致性,并结合多类联合训练策略进行优化。 Result: 在MVTec-AD、VisA和BTAD数据集上达到最先进的性能,兼具高灵敏度与强鲁棒性。 Conclusion: HarmoniAD有效平衡了结构与语义建模,在工业异常检测中显著提升了对微小缺陷的检测能力,具有实际应用价值。 Abstract: Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.[65] Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion
Yingzhi Tang,Qijian Zhang,Junhui Hou
Main category: cs.CV
TL;DR: 本文提出了一种名为JGA-LBD的新框架,通过将几何和外观建模统一到联合潜在表示中,并采用桥扩散生成方法,实现了从单张RGB图像对3D数字人的一致且高保真重建。
Details
Motivation: 现有方法通常将几何估计和外观合成解耦,导致重建不一致且难以统一。需要一种能够联合建模几何与外观的方法以提升重建质量。 Method: 提出JGA-LBD框架,将多种异构输入条件(如深度图、SMPL模型)统一为3D高斯表示,并通过共享的稀疏变分自编码器(VAE)压缩至统一的潜在空间;采用桥扩散机制,从部分观测的潜在码出发推断缺失部分;最后通过专用解码模块恢复完整3D结构并渲染新视角。 Result: 实验表明,JGA-LBD在几何保真度和外观质量方面均优于当前最先进的方法,尤其在复杂的真实场景中表现突出。 Conclusion: JGA-LBD通过统一的潜在表示和桥扩散生成机制,有效实现了3D数字人在几何与外观上的高一致性与高质量重建,具备较强的实用性与扩展潜力。 Abstract: Achieving consistent and high-fidelity geometry and appearance reconstruction of 3D digital humans from a single RGB image is inherently a challenging task. Existing studies typically resort to decoupled pipelines for geometry estimation and appearance synthesis, often hindering unified reconstruction and causing inconsistencies. This paper introduces \textbf{JGA-LBD}, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation and formulates the generation process as bridge diffusion. Observing that directly integrating heterogeneous input conditions (e.g., depth maps, SMPL models) leads to substantial training difficulties, we unify all conditions into the 3D Gaussian representations, which can be further compressed into a unified latent space through a shared sparse variational autoencoder (VAE). Subsequently, the specialized form of bridge diffusion enables to start with a partial observation of the target latent code and solely focuses on inferring the missing components. Finally, a dedicated decoding module extracts the complete 3D human geometric structure and renders novel views from the inferred latent representation. Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality, including challenging in-the-wild scenarios. Our code will be made publicly available at https://github.com/haiantyz/JGA-LBD.[66] Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation
Bruce Mugizi,Sudi Murindanyi,Olivia Nakacwa,Andrew Katumba
Main category: cs.CV
TL;DR: 提出了一种基于计算机视觉的实时智能交通监控系统,用于在资源受限的环境中实现车辆检测、车牌识别和速度估计,并通过自动化短信开罚单辅助交通执法。
Details
Motivation: 超速是道路交通事故的主要原因,尤其是在乌干达等基础设施有限的发展中国家,亟需低成本高效的交通监控解决方案。 Method: 使用YOLOv8进行车牌检测,CNN和Transformer模型进行字符识别,结合感兴趣区域进行车速估计,并利用Africa's Talking API通过短信自动开具罚单。 Result: YOLOv8车牌检测mAP为97.9%;Transformer字符识别错误率降至1.79%;车速估计误差控制在10km/h内;成功构建数据库并实现自动短信通知。 Conclusion: 该系统能有效满足资源受限地区交通管理需求,具备降低发展中国家交通事故的潜力。 Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.[67] OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning
Liuxiang Qiu,Hui Da,Yuzhen Niu,Tiesong Zhao,Yang Cao,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 本文提出了OmniVaT框架,用于解决视觉-触觉学习中的模态差异和域偏移问题,首次实现了单域泛化的多模态VTL(SDG-VTL)任务。
Details
Motivation: 视觉与触觉模态之间存在显著差异,且触觉传感器非标准化和数据采集不一致导致域间隙,限制了现有方法的泛化能力。 Method: 提出OmniVaT框架,包含多模态分数傅里叶适配器(MFFA)将视觉和触觉嵌入映射到统一的嵌入-频率空间以缓解模态差异,并引入离散树生成(DTG)模块通过层次化树结构获得多样且可靠的多模态分数表示,增强对未知域变化的适应性。 Result: 在SDG-VTL任务上进行了大量实验,结果表明OmniVaT在跨域泛化性能方面显著优于现有方法。 Conclusion: OmniVaT首次成功解决了单域泛化的多模态视觉-触觉学习问题,有效克服了模态差异和域偏移挑战,具有良好的实际应用前景。 Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.[68] Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers
Söhnke Benedikt Fischedick,Daniel Seichter,Benedict Stephan,Robin Schmidt,Horst-Michael Gross
Main category: cs.CV
TL;DR: 本文提出了一种基于RGB-D Transformer的高效模型DVEFormer,通过知识蒸馏学习密集文本对齐的视觉嵌入,支持细粒度语义理解、自然语言查询和3D建图,兼具实时性与灵活性。
Details
Motivation: 为了使机器人在家庭环境中能更有效地与非专业用户交互,需要对周围环境有更全面、灵活的理解能力,而传统固定类别的语义分割方法难以满足这一需求。 Method: 采用Alpha-CLIP生成的教师嵌入指导轻量级学生模型DVEFormer,通过知识蒸馏训练,输出像素级文本对齐的密集视觉嵌入,支持线性探针进行语义分割及自由文本查询。 Result: 在常见室内数据集上达到具有竞争力的性能,全模型在NVIDIA Jetson AGX Orin上运行速度为26.3 FPS,小型版本达77.0 FPS,并能有效支持3D语义建图和自然语言交互等应用。 Conclusion: DVEFormer可作为传统语义分割方法的即插即用替代方案,同时支持灵活的自然语言查询和移动机器人3D建图系统的无缝集成。 Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.[69] Mask-Conditioned Voxel Diffusion for Joint Geometry and Color Inpainting
Aarya Sumuk
Main category: cs.CV
TL;DR: 提出了一种轻量级的两阶段框架,用于受损3D物体的几何与颜色联合修复,适用于文化遗产数字化修复。
Details
Motivation: 受文化遗产文物数字修复的推动,需要有效修复受损3D物体的几何结构和颜色。 Method: 第一阶段使用2D卷积网络在RGB切片上预测损伤掩码,并聚合成体素掩码;第二阶段采用基于扩散的3D U-Net在体素网格上进行掩码条件下的修复,联合预测占据状态和颜色,使用复合损失函数。 Result: 在带有合成损伤的纹理文物数据集上评估,相比基于对称性的基线方法,在32^3分辨率下生成更完整的几何结构和更连贯的颜色重建结果。 Conclusion: 显式的掩码条件是一种有效引导体素扩散模型实现3D几何与颜色联合修复的实用方法。 Abstract: We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.[70] BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
Seungyeon Cho,Tae-kyun Kim
Main category: cs.CV
TL;DR: 提出了一种基于概率双流框架的骨架动作识别方法,通过融合多种模态信息提升细粒度识别性能。
Details
Motivation: 现有方法多关注大尺度身体运动,忽略对手部细微动作的建模,难以实现细粒度动作识别。 Method: 设计了一个无需校准的预处理流程,结合基于Noisy-OR的概率融合机制,并在统一框架中融合关节、骨骼、运动等骨架模态与RGB信息。 Result: 在多个标准数据集(如NTU RGB+D 60/120、PKU-MMD、N-UCLA)及新建的手部中心化基准上验证了方法的有效性,表现出更强的鲁棒性和准确性。 Conclusion: 该框架有效提升了骨架动作识别中对细微动作的感知能力,尤其在噪声和跨模态场景下具有优势。 Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.[71] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Yuxue Yang,Lue Fan,Ziqi Shi,Junran Peng,Feng Wang,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: NeoVerse是一个可扩展的4D世界模型,能够实现无需姿态估计的前馈式4D重建、在线单目退化模式模拟,并在多种真实场景视频中表现出优异的重建与生成能力。
Details
Motivation: 现有4D世界建模方法受限于多视角4D数据成本高或训练预处理复杂,缺乏可扩展性。 Method: 提出NeoVerse,采用无需姿态估计的前馈4D重建、在线单目退化模拟等技术,基于单目视频实现可扩展的4D建模。 Result: 在标准重建和生成基准上达到最先进性能,具备良好的跨域泛化能力和丰富下游应用潜力。 Conclusion: NeoVerse通过一系列协同设计,实现了对多样化真实场景视频的高效、可扩展4D建模,在重建与生成任务中表现优越。 Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io[72] RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection
Tao Wu,Qing Xu,Xiangjian He,Oakleigh Weekes,James Brown,Wenting Duan
Main category: cs.CV
TL;DR: 本文提出了RoLID-11K,首个用于车载摄像头检测路侧垃圾的大规模数据集,包含超过1.1万张标注图像,涵盖多样的英国驾驶场景,具有显著的长尾分布和小目标特征。研究评估了多种现代检测器在此任务上的表现,发现基于Transformer的模型定位精度最高,但实时模型受限于粗糙的特征层次。该数据集为动态驾驶场景中的极端小目标检测提供了新基准,旨在推动低成本、可扩展的路侧垃圾监测系统发展。
Details
Motivation: 现有垃圾检测数据集多针对街景静态图像、航拍或水体环境,无法反映车载摄像头视频中垃圾极小、稀疏且背景杂乱的特点。同时,当前监测方法依赖人工调查和公众报告,覆盖范围有限。因此需要一个专门针对车载视角的高质量数据集来推动自动化、大范围监测技术的发展。 Method: 构建了一个名为RoLID-11K的大规模数据集,包含来自英国多种驾驶条件下的11,000多张带标注的车载图像,突出小物体和长尾分布特性;在此基础上,对一系列先进的目标检测器(包括基于Transformer的高精度模型和YOLO系列实时模型)进行了系统性的基准测试与性能分析。 Result: 实验表明,CO-DETR等基于Transformer的模型在定位准确性方面表现最佳,而实时检测模型由于特征层级较粗,在小目标检测上仍存在局限性。RoLID-11K展现出较高的挑战性,尤其对小物体检测提出了严峻考验。 Conclusion: RoLID-11K是首个面向车载摄像头的路边垃圾检测大规模数据集,为极端小目标检测提供了新的基准,有助于推动低成本、可扩展的自动化路边垃圾监测系统的研究与应用。 Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.[73] ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis
Tyler Ward,Abdullah Imran
Main category: cs.CV
TL;DR: 提出了一种基于Transformer和Kolmogorov-Arnold Networks(KAN)的新型功能脑网络表示方法ABFR-KAN,用于自闭症谱系障碍(ASD)分类,克服了传统图谱分割带来的偏差问题,在ABIDE I数据集上表现优于现有方法。
Details
Motivation: 传统基于图谱的功能连接分析存在选择偏倚和忽略个体特异性的问题,影响脑疾病诊断的可靠性。 Method: 提出ABFR-KAN,结合Transformer架构与KAN,构建更符合解剖结构、减少结构偏差的功能脑网络表示方法,提升FC估计的准确性和模型泛化能力。 Result: 在ABIDE I数据集上进行了跨站点验证和消融实验,ABFR-KAN在ASD分类任务中 consistently 优于现有最先进方法。 Conclusion: ABFR-KAN能有效缓解传统分割带来的偏差,提高功能连接分析的可靠性和个体适应性,为脑疾病智能诊断提供了更优解决方案。 Abstract: Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at https://github.com/tbwa233/ABFR-KAN.[74] Robust Assembly Progress Estimation via Deep Metric Learning
Kazuma Miura,Sarthak Pathak,Kazunori Umeda
Main category: cs.CV
TL;DR: 本文提出了一种基于四元组损失的Anomaly Quadruplet-Net方法,用于在视觉变化微小或存在遮挡的情况下准确估计人工装配进度,尤其适用于小规模数据集。
Details
Motivation: 现有方法(如Anomaly Triplet-Net)在处理连续任务间视觉变化细微或部件遮挡时易发生误分类,难以准确估计多日手工装配的进度。 Method: 提出基于四元组损失(Quadruplet Loss)的深度度量学习方法,并设计自定义数据加载器以优化训练样本选择,提升对异常图像的学习能力。 Result: 在台式机装配图像数据集上,所提方法比现有方法精度提高1.3%,相邻任务误分类率降低1.9%。 Conclusion: Anomaly Quadruplet-Net在小规模数据集下对细微视觉变化和遮挡具有更强鲁棒性,能更准确地估计装配进度,有效提升智能工厂的自动化监控能力。 Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.[75] CPPO: Contrastive Perception for Vision Language Policy Optimization
Ahmad Rezaei,Mohsen Gholami,Saeed Ranjbar Alvar,Kevin Cannons,Mohammad Asiful Hossain,Zhou Weimin,Shunbo Zhou,Yong Zhang,Mohammad Akbari
Main category: cs.CV
TL;DR: CPPO是一种用于微调视觉-语言模型的对比感知策略优化方法,通过检测输入图像扰动下的熵变来识别感知标记,并引入对比感知损失(CPL)以提升多模态推理中的感知与推理能力。
Details
Motivation: 现有强化学习方法在扩展到多模态推理时难以有效分离感知与推理标记,依赖额外模型或标注数据,训练效率低且难以扩展。 Method: CPPO通过分析输入图像扰动下模型输出的熵变化来检测感知标记,并设计对比感知损失(CPL),在信息保持和信息丢失扰动下分别增强输出一致性与敏感性,从而改进RL目标函数。 Result: 实验证明CPPO优于以往的感知奖励方法,在无需额外模型的情况下提升了训练效率和可扩展性。 Conclusion: CPPO有效解决了多模态强化学习中感知与推理标记分离的难题,提供了一种高效、可扩展的VLM微调方案。 Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.[76] MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation
Miaowei Wang,Jakub Zadrożny,Oisin Mac Aodha,Amir Vaxman
Main category: cs.CV
TL;DR: MotionPhysics是一种端到端可微分的框架,能够根据用户提供的自然语言提示推断3D场景中物体的合理物理参数,实现无需真实轨迹或标注视频指导的逼真动态模拟。
Details
Motivation: 传统物理模拟需要专家知识和耗时的参数调优来获得期望的动力学行为,而MotionPhysics旨在通过自然语言降低使用门槛,自动推断合理的物理参数。 Method: 利用多模态大语言模型解析自然语言提示并估计材料参数,结合可学习的运动蒸馏损失从预训练视频扩散模型中提取运动先验,减少外观和几何偏差对模拟的干扰。 Result: 在三十多个包含真实世界、人工设计和AI生成的3D对象场景中验证了方法的有效性,覆盖弹性体、金属、泡沫、沙子及牛顿/非牛顿流体等多种材料,生成的动态模拟具有视觉真实感且优于现有最先进方法。 Conclusion: MotionPhysics能通过自然语言指导实现高质量、物理合理的动态模拟,显著提升模拟效率与可用性,推动物理模拟向更广泛用户开放。 Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.[77] All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations
Wenrui Li,Hongtao Chen,Yao Xiao,Wangmeng Zuo,Jiantao Zhou,Yonghong Tian,Xiaopeng Fan
Main category: cs.CV
TL;DR: 本文提出了针对视频中平滑演化的未知退化问题(SEUD),设计了一种能够处理单个、复合及随时间演变退化的全合一图像恢复模型ORCANet,通过粗略去雾和动态提示生成实现高质量、时间连续性强的视频恢复。
Details
Motivation: 现有方法主要关注帧间退化变化,忽略了真实世界中退化过程的时间连续性,难以有效应对视频中退化类型与强度随时间平滑变化的情况。 Method: 提出SEUD场景和灵活的合成 pipeline 生成具有时间一致性的退化视频;设计ORCANet,包含基于物理先验的粗略去雾模块(CIED)、流式提示生成模块(FPG)用于提取静态与动态退化特征,并引入标签感知监督机制提升静态提示的判别能力。 Result: 实验表明ORCANet在恢复质量、时间一致性和鲁棒性方面优于现有的图像和视频基线方法。 Conclusion: ORCANet能有效应对视频中复杂且连续变化的未知退化,为全合一视频恢复提供了新的解决方案。 Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.[78] FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
Ruiqiang Zhang,Hengyi Wang,Chang Liu,Guanjie Wang,Zehua Ma,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的文本到图像生成框架FreeText,通过利用扩散Transformer(DiT)模型的内在机制,显著提升多行、密集和长尾文字(如中文)的文本渲染质量。
Details
Motivation: 现有文本到图像模型在精确文本渲染方面存在不足,尤其对复杂排版和非拉丁文字支持较差,且依赖昂贵重训练或固定布局约束,影响灵活性与美观性。 Method: FreeText将问题分解为“在哪里写”和“写什么”。前者通过分析图像-文本注意力中的词元空间归因,结合汇式词元和拓扑感知优化生成高置信度书写区域掩码;后者提出频谱调制字形注入(SGMI),在频域中注入噪声对齐的字形先验,增强字形结构并抑制语义泄露。 Result: 在Qwen-Image、FLUX.1-dev和SD3等多个模型及LongText-Benchmark、CVTG、CLT-Bench等基准上实验表明,FreeText显著提升了文本可读性,同时保持了语义一致性和视觉美感,仅引入轻微推理开销。 Conclusion: FreeText是一种无需训练、即插即用的框架,有效解决了扩散模型在复杂文本渲染中的关键挑战,具有良好的通用性和实用性。 Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.[79] Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios
Guangqian Guo,Pengfei Chen,Yong Guo,Huafeng Chen,Boqiang Zhang,Shan Gao
Main category: cs.CV
TL;DR: 本文提出了VNS-SAM,通过改进SAM的低层特征利用来增强其在视觉非显著场景下的分割能力,同时保持其零样本泛化性,并构建了包含35K以上图像的统一数据集VNS-SEG用于评估。
Details
Motivation: 现有方法在处理前景与背景对比度低的视觉非显著场景时难以准确分割,SAM在此类场景下表现受限。 Method: 提出Mask-Edge Token Interactive decoder和Non-Salient Feature Mining模块,有效利用SAM的低层特征,在参数和计算开销极小的情况下提升对非显著特征的理解。 Result: 在多种视觉非显著分割任务上实验表明,VNS-SAM在零样本设置下表现出优越性能,且额外参数可在4小时内完成优化。 Conclusion: VNS-SAM显著提升了SAM在视觉非显著场景中的分割效果,具备良好的实用性与广泛的应用潜力。 Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.[80] DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction
Jiacheng Sui,Yujie Zhou,Li Niu
Main category: cs.CV
TL;DR: 本文提出了一种新的图像拖拽编辑方法DynaDrag,采用预测-移动框架,通过迭代进行运动预测和运动监督,并动态调整有效控制点,显著提升了像素级图像编辑的性能。
Details
Motivation: 为了解决现有拖拽式图像编辑方法中跟踪丢失、模糊跟踪、源图像与目标图像差距大以及中间点不合理等问题,需要一种更高效、可编辑性更强的方法。 Method: 提出DynaDrag,首个基于预测-移动框架的拖拽方法;在每次迭代中先预测控制点的运动方向(Motion Prediction),再执行拖拽操作(Motion Supervision),并动态调整有效控制点以提升性能。 Result: 在人脸和人体数据集上的实验表明,DynaDrag在编辑准确性和可编辑性方面优于先前方法。 Conclusion: DynaDrag通过新颖的预测-移动框架和动态控制点调整机制,有效克服了传统拖拽方法的多种缺陷,实现了更优的像素级图像编辑效果。 Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.[81] SingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging under arbitrary array
Shuang Li,Yibing Wang,Jian Gao,Chulhong Kim,Seongwook Choi,Yu Zhang,Qian Chen,Yao Yao,Changhui Li
Main category: cs.CV
TL;DR: 本文提出了一种名为SlingBAG Pro的先进重建算法,用于解决不规则几何换能器阵列在三维光声成像中的重建难题,显著提升了重建速度与效率。
Details
Motivation: 为了解决空间受限和成本高昂的问题,采用适应特定成像区域的不规则几何换能器阵列是实现高质量3D PAI的有前景方案,但传统迭代重建算法难以应对这种配置,存在计算复杂度高、内存需求大和重建时间长等问题。 Method: 基于滑动球自适应生长(SlingBAG)方法的点云迭代概念,扩展其对任意阵列几何形状的兼容性,并引入分层优化策略,结合零梯度滤波和逐步增加的时间采样率来加速收敛。 Result: 与原始SlingBAG算法相比,在不规则阵列下实现了最高2.2倍的加速效果;通过仿真和活体小鼠实验验证了方法的有效性。 Conclusion: SlingBAG Pro能够在减少所需换能器数量的同时保持高质量重建,显著缩短重建时间,适用于不规则阵列的三维光声成像,具有临床应用潜力。 Abstract: High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at https://github.com/JaegerCQ/SlingBAG_Pro.[82] A Comprehensive Dataset for Human vs. AI Generated Image Detection
Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Gaytri Jena,Vasu Sharma,Vinija Jain,Aman Chadha,Aishwarya Naresh Reganti,Amitava Das
Main category: cs.CV
TL;DR: 本文提出了一个用于检测AI生成图像的新数据集MS COCOAI,包含96000个真实和合成图像样本,并基于该数据集定义了两个任务:区分图像为真实或生成,以及识别生成图像所用的模型。
Details
Motivation: 随着AI生成图像越来越逼真,难以与真实照片区分,导致误导性内容和虚假信息传播的风险增加,因此迫切需要有效的检测手段。 Method: 基于MS COCO数据集构建MS COCOAI数据集,使用Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6五种生成模型创建合成图像,形成包含96000个样本的数据集,并提出两个任务:图像真实性分类和生成模型识别。 Result: 发布了可用于AI生成图像检测的大规模数据集MS COCOAI,支持两种检测任务,并已公开在Hugging Face平台。 Conclusion: MS COCOAI为检测AI生成图像提供了重要资源,有助于应对虚假信息挑战,推动图像真实性验证技术的发展。 Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.[83] AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models
Jintao Lin,Bowen Dong,Weikang Shi,Chenyang Lei,Suiyun Zhang,Rui Liu,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出了AEGIS,一个用于评估统一多模态模型(UMMs)在跨任务中运用世界知识能力的综合性多任务基准,并引入确定性清单评估(DCE)协议以提高评估可靠性。实验表明当前UMMs存在严重的世界知识缺陷,尤其在复杂推理下性能显著下降,但简单的推理模块插件可部分缓解该问题。
Details
Motivation: 现有基准测试局限于单一任务、孤立评估,缺乏对UMMs在多任务中综合运用世界知识能力的系统性评测,难以诊断其真实缺陷。 Method: 提出AEGIS基准,包含1050个手工标注的难题,覆盖21个主题和6种推理类型,涉及视觉理解、生成、编辑与交错生成任务;同时提出确定性清单评估(DCE)协议,采用原子化的“是/否”判断替代模糊的提示式评分。 Result: 实验显示大多数UMMs在世界知识掌握上存在严重不足,复杂推理任务中性能显著下降;DCE提升了评估的可靠性和可解释性;简单推理模块插件能部分改善模型表现。 Conclusion: 世界知识驱动的推理是UMMs发展的关键前沿,需更全面、可靠的评估体系来推动其向超级智能发展,AEGIS与DCE为此提供了有效工具与方向。 Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.[84] A Cascaded Information Interaction Network for Precise Image Segmentation
Hewen Xiao,Jie Mei,Guangfu Ma,Weiren Wu
Main category: cs.CV
TL;DR: 本文提出了一种级联卷积神经网络,结合新型全局信息引导模块,有效融合多层次特征,显著提升复杂场景下的图像分割精度。
Details
Motivation: 解决复杂场景中视觉分割鲁棒性不足的问题,克服传统方法在多尺度特征提取上的局限。 Method: 设计并集成全局信息引导模块的级联卷积神经网络,融合低层纹理细节与高层语义特征。 Result: 在基准数据集上实验表明,该方法在分割精度上优于现有最先进方法,尤其在杂乱或模糊环境中表现突出。 Conclusion: 所提框架有效提升了视觉分割性能,具有在实际机器人应用中部署的潜力。 Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.[85] GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
Mingyu Jeon,Sunjae Yoon,Jonghee Kim,Junyeoung Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视频时刻检索框架GranAlign,通过粒度感知对齐解决文本查询与视频内容之间的语义粒度不匹配问题。
Details
Motivation: 现有零样本视频 moment 检索方法未能平衡多模态间语义粒度,导致检索不准确。 Method: 提出GranAlign框架,包含基于粒度的查询重写和查询感知的字幕生成,以实现多级语义对齐。 Result: 在QVHighlights、Charades-STA和ActivityNet-Captions三个基准上达到最先进性能,其中QVHighlights的mAP@avg提升3.23%。 Conclusion: GranAlign有效缓解了语义粒度不匹配问题,显著提升了零样本视频 moment 检索性能。 Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.[86] SafeMo: Linguistically Grounded Unlearning for Trustworthy Text-to-Motion Generation
Yiling Wang,Zeyu Zhang,Yiran Wang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出SafeMo,一种基于连续空间的可信文本到动作生成框架,通过最小运动遗忘(MMU)策略实现对不安全行为的有效遗忘,同时保持自然动作过渡和良性性能。
Details
Motivation: 现有基于离散码本替换的文本到动作安全方法存在性能漂移和动作不连贯问题,且数据集中包含不安全内容,缺乏可靠的安全驱动训练数据。 Method: 提出SafeMo框架,采用两阶段机器遗忘策略MMU,在连续空间中进行运动解耦与优化,并构建首个安全文本-动作数据集SafeMoVAE-29K,结合DiP模型实现安全且自然的动作生成。 Result: 在HumanML3D和Motion-X数据集上,Forget-set FID分别达到2.5倍和14.4倍于当前SOTA方法LCR的效果,同时保持或提升了对安全提示的生成质量。 Conclusion: SafeMo实现了更优的安全-效用权衡,解决了离散码本替换带来的性能下降与动作不连贯问题,为可信人机交互提供了有效解决方案。 Abstract: Text-to-motion (T2M) generation with diffusion backbones achieves strong realism and alignment. Safety concerns in T2M methods have been raised in recent years; existing methods replace discrete VQ-VAE codebook entries to steer the model away from unsafe behaviors. However, discrete codebook replacement-based methods have two critical flaws: firstly, replacing codebook entries which are reused by benign prompts leads to drifts on everyday tasks, degrading the model's benign performance; secondly, discrete token-based methods introduce quantization and smoothness loss, resulting in artifacts and jerky transitions. Moreover, existing text-to-motion datasets naturally contain unsafe intents and corresponding motions, making them unsuitable for safety-driven machine learning. To address these challenges, we propose SafeMo, a trustworthy motion generative framework integrating Minimal Motion Unlearning (MMU), a two-stage machine unlearning strategy, enabling safe human motion generation in continuous space, preserving continuous kinematics without codebook loss and delivering strong safety-utility trade-offs compared to current baselines. Additionally, we present the first safe text-to-motion dataset SafeMoVAE-29K integrating rewritten safe text prompts and continuous refined motion for trustworthy human motion unlearning. Built upon DiP, SafeMo efficiently generates safe human motions with natural transitions. Experiments demonstrate effective unlearning performance of SafeMo by showing strengthened forgetting on unsafe prompts, reaching 2.5x and 14.4x higher forget-set FID on HumanML3D and Motion-X respectively, compared to the previous SOTA human motion unlearning method LCR, with benign performance on safe prompts being better or comparable. Code: https://github.com/AIGeeksGroup/SafeMo. Website: https://aigeeksgroup.github.io/SafeMo.[87] Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception
Xianhui Liu,Siqi Jiang,Yi Xie,Yuqing Lin,Siao Liu
Main category: cs.CV
TL;DR: 本文提出了一种新的RGB-红外多模态检测框架,通过引入模态主导指数(MDI)量化跨模态优化中的偏差,并设计了模态感知的交叉学习框架MDACL,包含分层交叉引导和对抗均衡正则化,有效缓解了模态不平衡问题,在三个基准上实现了最先进的性能。
Details
Motivation: 现有跨模态融合方法在RGB-红外检测中未充分探索由于模态间信息密度和特征质量不对称导致的优化偏差问题,容易造成模型过度依赖主导模态,限制融合效果。 Method: 提出了模态主导指数(MDI),结合特征熵与梯度贡献来量化模态主导性;基于MDI构建了MDACL框架,包含分层交叉模态引导(HCG)增强特征对齐,以及对抗均衡正则化(AER)平衡优化动态。 Result: 在三个RGB-IR基准上的实验表明,MDACL能有效减轻优化偏差,提升模型融合性能,达到当前最先进水平。 Conclusion: 通过量化和调节模态主导性,MDACL为解决多模态学习中的优化不平衡问题提供了有效途径,增强了RGB-IR跨模态检测的鲁棒性和性能。 Abstract: RGB-Infrared (RGB-IR) multimodal perception is fundamental to embodied multimedia systems operating in complex physical environments. Although recent cross-modal fusion methods have advanced RGB-IR detection, the optimization dynamics caused by asymmetric modality characteristics remain underexplored. In practice, disparities in information density and feature quality introduce persistent optimization bias, leading training to overemphasize a dominant modality and hindering effective fusion. To quantify this phenomenon, we propose the Modality Dominance Index (MDI), which measures modality dominance by jointly modeling feature entropy and gradient contribution. Based on MDI, we develop a Modality Dominance-Aware Cross-modal Learning (MDACL) framework that regulates cross-modal optimization. MDACL incorporates Hierarchical Cross-modal Guidance (HCG) to enhance feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion. Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves SOTA performance.[88] Noise-Robust Tiny Object Localization with Flows
Huixin Sun,Linlin Yang,Ronyu Chen,Kerui Gu,Baochang Zhang,Angela Yao,Xianbin Cao
Main category: cs.CV
TL;DR: 本文提出了一种名为TOLF的噪声鲁棒定位框架,用于改善微小物体检测中的性能,通过流式建模误差分布和不确定性引导优化,有效缓解标注噪声导致的过拟合问题。
Details
Motivation: 微小物体检测在现有方法中仍存在显著性能差距,主要因为其对标注噪声敏感,传统严格定位目标易导致噪声过拟合。 Method: 提出Tiny Object Localization with Flows (TOLF),利用标准化流进行灵活的误差建模,捕捉复杂非高斯预测分布,并设计不确定性感知的梯度调制机制,抑制高不确定性样本的学习。 Result: 在三个数据集上进行了广泛实验,TOLF在AI-TOD数据集上将DINO基线提升了1.2% AP。 Conclusion: TOLF能有效提升微小物体检测的鲁棒性和性能,尤其在存在标注噪声的情况下表现出优越的抗噪能力和训练稳定性。 Abstract: Despite significant advances in generic object detection, a persistent performance gap remains for tiny objects compared to normal-scale objects. We demonstrate that tiny objects are highly sensitive to annotation noise, where optimizing strict localization objectives risks noise overfitting. To address this, we propose Tiny Object Localization with Flows (TOLF), a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization. Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision. An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training. Extensive experiments across three datasets validate our approach's effectiveness. Especially, TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset.[89] RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation
Junxiao Xue,Pavel Smirnov,Ziao Li,Yunyun Shi,Shi Chen,Xinyi Yin,Xiaohan Yue,Lei Wang,Yiduo Wang,Feng Lin,Yijia Chen,Xiao Ma,Xiaoran Yan,Qing Zhang,Fengjian Xue,Xuecheng Wu
Main category: cs.CV
TL;DR: 提出了一种名为RePose的实时3D人体姿态估计与运动分析方法,用于康复训练,支持多视角RGB视频输入,具备快速跟踪、平滑姿态估计和实时反馈功能。
Details
Motivation: 为了在康复训练中实现对患者动作的实时监控与评估,提供即时反馈以纠正动作,提升康复效果。 Method: 设计了一个端到端的统一框架,结合多摄像头RGB视频输入进行实时3D姿态估计;提出一种快速跟踪方法应对多人干扰场景;改进SmoothNet以减少姿态估计误差;利用Unity平台实现运动状态可视化与肌肉受力显示。 Result: 实现了低于1ms的单帧跟踪速度,提升了姿态估计的准确性和平滑性,并在Unity中实现实时监测、评估与肌肉应力可视化。 Conclusion: RePose可有效应用于康复训练场景,支持实时动作纠正与反馈,有助于患者恢复肌肉力量和运动功能。 Abstract: We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients'motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients'actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient's true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients' motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.[90] HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis
Shuren Gabriel Yu,Sikang Ren,Yongji Tian
Main category: cs.CV
TL;DR: 提出HyperPriv-EPN,一种基于超图的利用特权信息学习框架,通过双流蒸馏将术后文本知识迁移到术前MRI诊断中,实现无需推理时文本输入的高质量预后预测。
Details
Motivation: 术前室管膜瘤预后困难,因MRI缺乏语义信息,而现有方法无法有效利用术后的特权文本数据。 Method: 构建超图模型HyperPriv-EPN,采用分离图策略,共享编码器处理含特权信息的教师图和仅含术前数据的学生图,通过双流蒸馏使学生网络从视觉特征中模拟语义结构。 Result: 在311例多中心患者队列上验证,HyperPriv-EPN在诊断准确性和生存分层方面达到最先进水平。 Conclusion: 该方法成功将术后专家知识迁移至术前诊断,有效利用历史数据提升无文本输入时的新病例预测能力。 Abstract: Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) framework. We introduce a Severed Graph Strategy, utilizing a shared encoder to process both a Teacher graph (enriched with privileged post-surgery information) and a Student graph (restricted to pre-operation data). Through dual-stream distillation, the Student learns to hallucinate semantic community structures from visual features alone. Validated on a multi-center cohort of 311 patients, HyperPriv-EPN achieves state-of-the-art diagnostic accuracy and survival stratification. This effectively transfers expert knowledge to the preoperative setting, unlocking the value of historical post-operative data to guide the diagnosis of new patients without requiring text at inference.[91] Quality Detection of Stored Potatoes via Transfer Learning: A CNN and Vision Transformer Approach
Shrikant Kapse,Priyankkumar Dhrangdhariya,Priya Kedia,Manasi Patwardhan,Shankar Kausley,Soumyadipta Maiti,Beena Rai,Shirish Karande
Main category: cs.CV
TL;DR: 本研究利用基于图像的深度学习模型(如DenseNet)实现马铃薯贮藏期间的质量监测,可高效检测发芽、估计重量损失并预测货架期,具有高准确率,适用于自动化分拣与库存管理,有助于减少食物浪费。
Details
Motivation: 解决马铃薯在贮藏过程中因发芽、失重和货架期不确定导致的质量管理难题,提供一种非侵入性、可扩展的自动化监测方案。 Method: 采集200天内受控环境下的马铃薯图像及重量数据,采用ResNet、VGG、DenseNet和Vision Transformer等预训练模型构建:1)用于发芽检测的二分类模型;2)用于重量损失与货架期预测的多分类模型。 Result: DenseNet在发芽检测中达到98.03%的准确率;货架期预测在粗粒度分类(2-5类)下准确率超89.83%,细粒度分类性能下降;模型能有效支持动态分类与库存管理。 Conclusion: 基于图像的深度学习方法可实现马铃薯贮藏质量的非破坏性、低成本评估,具备在供应链中推广应用的潜力,未来需构建涵盖更多品种和条件的通用模型以提升适应性。 Abstract: Image-based deep learning provides a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges such as sprout detection, weight loss estimation, and shelf-life prediction. In this study, images and corresponding weight data were collected over a 200-day period under controlled temperature and humidity conditions. Leveraging powerful pre-trained architectures of ResNet, VGG, DenseNet, and Vision Transformer (ViT), we designed two specialized models: (1) a high-precision binary classifier for sprout detection, and (2) an advanced multi-class predictor to estimate weight loss and forecast remaining shelf-life with remarkable accuracy. DenseNet achieved exceptional performance, with 98.03% accuracy in sprout detection. Shelf-life prediction models performed best with coarse class divisions (2-5 classes), achieving over 89.83% accuracy, while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data per class. These findings demonstrate the feasibility of integrating image-based models into automated sorting and inventory systems, enabling early identification of sprouted potatoes and dynamic categorization based on storage stage. Practical implications include improved inventory management, differential pricing strategies, and reduced food waste across supply chains. While predicting exact shelf-life intervals remains challenging, focusing on broader class divisions ensures robust performance. Future research should aim to develop generalized models trained on diverse potato varieties and storage conditions to enhance adaptability and scalability. Overall, this approach offers a cost-effective, non-destructive method for quality assessment, supporting efficiency and sustainability in potato storage and distribution.[92] Reconstructing Building Height from Spaceborne TomoSAR Point Clouds Using a Dual-Topology Network
Zhaiyu Chen,Yuanyuan Wang,Yilei Shi,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于学习的双拓扑网络框架,用于从星载SAR层析成像(TomoSAR)点云中生成高分辨率建筑物高度图,首次实现了直接从TomoSAR点云进行大尺度城市高度制图的概念验证,并可通过融合光学遥感影像进一步提升精度。
Details
Motivation: TomoSAR点云常存在噪声、各向异性分布和非相干表面导致的数据缺失问题,限制了建筑高度的准确重建,因此需要一种鲁棒的方法来提升高度估计质量。 Method: 提出一种双拓扑网络,包含处理不规则散射体特征的点分支和保证空间一致性的网格分支,联合优化以实现去噪与空洞填充,从而生成连续的高分辨率高度图,并可融合光学卫星影像以增强重建效果。 Result: 在慕尼黑和柏林的数据上实验表明该方法能有效重建建筑高度,是首个直接从TomoSAR点云实现大范围城市高度制图的学习框架,且融合光学影像后性能进一步提升。 Conclusion: 该框架为TomoSAR点云在城市建筑高度估计中的应用提供了新思路,具有良好的扩展性和实际应用潜力,推动了SAR数据在城市遥感中的使用。 Abstract: Reliable building height estimation is essential for various urban applications. Spaceborne SAR tomography (TomoSAR) provides weather-independent, side-looking observations that capture facade-level structure, offering a promising alternative to conventional optical methods. However, TomoSAR point clouds often suffer from noise, anisotropic point distributions, and data voids on incoherent surfaces, all of which hinder accurate height reconstruction. To address these challenges, we introduce a learning-based framework for converting raw TomoSAR points into high-resolution building height maps. Our dual-topology network alternates between a point branch that models irregular scatterer features and a grid branch that enforces spatial consistency. By jointly processing these representations, the network denoises the input points and inpaints missing regions to produce continuous height estimates. To our knowledge, this is the first proof of concept for large-scale urban height mapping directly from TomoSAR point clouds. Extensive experiments on data from Munich and Berlin validate the effectiveness of our approach. Moreover, we demonstrate that our framework can be extended to incorporate optical satellite imagery, further enhancing reconstruction quality. The source code is available at https://github.com/zhu-xlab/tomosar2height.[93] CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models
Neeraj Anand,Samyak Jha,Udbhav Bamba,Rahul Rahaman
Main category: cs.CV
TL;DR: 提出了一种无需训练的幻觉缓解框架CRoPS,通过选择性移除关键文本标记和广义对比解码,在多个基准和大视觉语言模型家族中显著减少幻觉。
Details
Motivation: 现有无训练方法在处理大视觉语言模型中的幻觉问题时存在假设狭窄和生成后期效果差的问题。 Method: 提出一种新的幻觉模型,通过选择性移除关键文本标记来捕捉幻觉效应,并引入广义对比解码,整合多个幻觉模型以表示多样化的幻觉来源。 Result: CRoPS框架在CHAIR分数上提升了20%,并在六个基准和三个LVLM家族中实现了持续增益。 Conclusion: CRoPS有效缓解了大视觉语言模型中的幻觉问题,优于现有的无训练方法。 Abstract: Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.[94] Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
Melonie de Almeida,Daniela Ivanova,Tong Shi,John H. Williamson,Paul Henderson
Main category: cs.CV
TL;DR: 提出一种基于单张图像的4D视频生成框架,通过3D高斯场景表示和一次性前向传播实现相机控制下的高效、高质量视频生成。
Details
Motivation: 现有单图生成视频方法在用户可控性(如相机路径编辑)方面不足,且难以同时保证时间一致性、几何完整性和运动建模的准确性。 Method: 构建3D高斯场景表示,在单次前向传播中联合建模相机运动与物体运动,直接生成4D视频,避免迭代去噪。 Result: 在KITTI、Waymo、RealEstate10K和DL3DV-10K数据集上实现了最先进的视频质量和推理效率。 Conclusion: 该方法实现了快速、可控且 temporally consistent 的单图像视频生成,推动了可操控视觉生成模型在实际应用中的发展。 Abstract: Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.[95] Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks
Cory Fan,Wenchao Zhang
Main category: cs.CV
TL;DR: 本文提出在图像去马赛克的各向同性网络中引入显著的空间下采样,以提升网络效率和性能,并通过设计全卷积网络验证了该方法的有效性,提出的JD3Net在多种任务上表现出色。
Details
Motivation: 现有的各向同性网络通常避免空间下采样,导致计算成本高,难以应用于移动平台,本文旨在探索下采样对效率和性能的影响。 Method: 采用源自DeepMAD的数学架构设计技术,设计了包含与不包含下采样的全卷积网络,并进行对比实验。 Result: 实验表明,引入空间下采样的网络在计算效率和去马赛克及联合去马赛克去噪任务上的表现均优于无下采样的网络。 Conclusion: 空间下采样可有效提升各向同性网络在图像去马赛克任务中的效率与性能,为移动端应用提供了更优的解决方案。 Abstract: In digital imaging, image demosaicing is a crucial first step which recovers the RGB information from a color filter array (CFA). Oftentimes, deep learning is utilized to perform image demosaicing. Given that most modern digital imaging applications occur on mobile platforms, applying deep learning to demosaicing requires lightweight and efficient networks. Isotropic networks, also known as residual-in-residual networks, have been often employed for image demosaicing and joint-demosaicing-and-denoising (JDD). Most demosaicing isotropic networks avoid spatial downsampling entirely, and thus are often prohibitively expensive computationally for mobile applications. Contrary to previous isotropic network designs, this paper claims that spatial downsampling to a signficant degree can improve the efficiency and performance of isotropic networks. To validate this claim, we design simple fully convolutional networks with and without downsampling using a mathematical architecture design technique adapted from DeepMAD, and find that downsampling improves empirical performance. Additionally, empirical testing of the downsampled variant, JD3Net, of our fully convolutional networks reveals strong empirical performance on a variety of image demosaicing and JDD tasks.[96] RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization
Wei-Tse Cheng,Yen-Jen Chiou,Yuan-Fu Yang
Main category: cs.CV
TL;DR: RGS-SLAM是一种鲁棒的高斯溅射SLAM框架,通过训练-free的对应点到高斯初始化取代了GS-SLAM中基于残差的致密化阶段,实现更稳定和快速的映射。
Details
Motivation: 传统GS-SLAM依赖残差驱动的逐步高斯添加策略,在复杂场景中可能导致不均匀分布和收敛缓慢,因此需要一种更高效且结构感知的初始化方法。 Method: RGS-SLAM利用DINOv3描述符提取密集多视图对应点,并通过置信度感知的内点分类器优化对应关系,进行一次性三角化以生成结构感知的高斯种子作为优化初始值。 Result: 在TUM RGB-D和Replica数据集上评估显示,RGS-SLAM比现有最先进系统具有更好或相当的定位与重建精度,渲染保真度更高,映射速度提升约20%,最高可达925 FPS。 Conclusion: RGS-SLAM通过训练-free的一次性初始化策略显著提升了SLAM系统的稳定性、收敛速度和重建质量,同时保持与现有GS-SLAM流程的兼容性。 Abstract: We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20\%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS.[97] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
Hao Guan,Li Zhou
Main category: cs.CV
TL;DR: 本文研究了在数据分布变化下视觉-语言模型(VLM)性能退化检测的问题,提出结合输入数据漂移检测和输出置信度指标来更可靠地监控模型可靠性。
Details
Motivation: 部署后的VLM在输入数据分布发生变化时可能出现性能下降,而缺乏标签数据使得性能退化检测具有挑战性,因此需要有效方法来确保临床可靠性。 Method: 研究分析了输入级数据漂移和输出级预测行为的作用,开发了DomainSAT工具箱用于系统分析输入数据漂移,并提出一种无标签、基于置信度的输出监控指标。 Result: 实验表明输入漂移检测可提供早期信号但不总对应实际性能下降,而基于置信度的指标与性能退化密切相关,二者结合能更可靠地检测性能退化。 Conclusion: 结合输入数据漂移检测和输出置信度监控为数字病理学中VLM的可靠性监测提供了实用且互补的框架。 Abstract: Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.[98] Multi-Level Feature Fusion for Continual Learning in Visual Quality Inspection
Johannes C. Bauer,Paul Geng,Stephan Trattnig,Petr Dokládal,Rüdiger Daub
Main category: cs.CV
TL;DR: 提出了一种多级特征融合(MLFF)方法,用于在制造质量检测中实现高效持续学习,显著减少可训练参数的同时匹配端到端训练的性能。
Details
Motivation: 在再制造等动态场景中,产品和缺陷模式频繁变化,导致深度神经网络需要频繁适应新条件,面临灾难性遗忘和计算效率低的问题。 Method: 利用预训练网络不同层次的特征表示,构建多级特征融合(MLFF)方法,以提升模型适应性和计算效率。 Result: MLFF方法在多个质量检测任务上达到与端到端训练相当的性能,但可训练参数大幅减少,同时减轻了灾难性遗忘并增强了对新产品或缺陷的泛化鲁棒性。 Conclusion: MLFF是一种高效且鲁棒的持续学习方案,适用于动态工业视觉检测场景,能够在资源受限条件下实现快速模型适应。 Abstract: Deep neural networks show great potential for automating various visual quality inspection tasks in manufacturing. However, their applicability is limited in more volatile scenarios, such as remanufacturing, where the inspected products and defect patterns often change. In such settings, deployed models require frequent adaptation to novel conditions, effectively posing a continual learning problem. To enable quick adaptation, the necessary training processes must be computationally efficient while still avoiding effects like catastrophic forgetting. This work presents a multi-level feature fusion (MLFF) approach that aims to improve both aspects simultaneously by utilizing representations from different depths of a pretrained network. We show that our approach is able to match the performance of end-to-end training for different quality inspection problems while using significantly less trainable parameters. Furthermore, it reduces catastrophic forgetting and improves generalization robustness to new product types or defects.[99] Grading Handwritten Engineering Exams with Multimodal Large Language Models
Janez Perš,Jon Muhovič,Andrej Košir,Boštjan Murovec
Main category: cs.CV
TL;DR: 提出了一种基于多模态大语言模型的端到端手写工程测验自动评分工作流,保留传统考试形式,通过结构化提示和参考答案引导提升评分准确性与可靠性。
Details
Motivation: 手动批改手写STEM考试耗时且难以扩展,现有自动化方法往往限制学生书写自由或依赖电子输入,无法适应真实教学场景。 Method: 构建一个多阶段评分流程:先进行格式/存在性检查避免空白作答被评分,利用多模态大模型将讲师手写参考答案转为文本摘要作为评分条件,采用独立评分员集成、监督聚合以及刚性模板与确定性验证生成可审计报告。 Result: 在斯洛文尼亚语真实课程测验上评估(含手绘电路图),使用GPT-5.2和Gemini-3 Pro后端时,平均绝对分差约8分,偏差小,在最大差异D_max=40下估计需人工复核率约为17%;消融实验表明简单提示和移除参考答案会显著降低准确率并导致系统性高估。 Conclusion: 结构化提示和参考答案引导对实现可靠自动评分至关重要,该流程能在不改变传统考试方式的前提下有效支持大规模手写STEM考试的自动评分。 Abstract: Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.[100] Unified Primitive Proxies for Structured Shape Completion
Zhaiyu Chen,Yuqing Wang,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: UniCo提出了一种新的结构化形状补全方法,通过专用路径解码基元,统一预测具有完整几何、语义和内点成员关系的基元集合,在合成和真实数据上显著优于现有方法。
Details
Motivation: 传统级联方法在基元与点云交互上存在局限,需要更有效的结构化补全方式以支持基于基元的表面重建。 Method: 设计UniCo框架,采用共享形状特征的专用解码路径,引入可学习的基元代理(primitive proxies)作为查询,并通过在线目标更新实现基元与点云的一致性优化。 Result: 在多个合成与真实世界基准上,结合四种独立组装求解器,Chamfer距离最多降低50%,法向一致性最高提升7%。 Conclusion: UniCo为从不完整数据中进行结构化3D理解提供了一种高效且统一的解决方案。 Abstract: Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data. Project page: https://unico-completion.github.io.[101] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection
Shukesh Reddy,Srijan Das,Abhijit Das
Main category: cs.CV
TL;DR: 本文探索了将自监督学习作为辅助任务,以提升广义深度伪造检测性能的方法,发现融合自监督特征表示能有效增强主任务的泛化能力。
Details
Motivation: 为了提升深度伪造检测模型在跨数据集场景下的泛化能力,探索自监督学习作为辅助任务的潜力。 Method: 采用多种训练方案组合,将自监督学习作为辅助任务,并融合其特征表示到主任务中,以优化广义深度伪造检测。 Result: 在多个大规模数据集(如DF40、FaceForensics++、Celeb-DF等)上实验表明,所提出的方法在跨数据集评估中优于当前最先进的检测器。 Conclusion: 融合自监督学习的特征表示是一种有效提升深度伪造检测泛化性能的方法,能够结合两种任务的优势,实现更优的检测效果。 Abstract: In this work, we attempted to unleash the potential of self-supervised learning as an auxiliary task that can optimise the primary task of generalised deepfake detection. To explore this, we examined different combinations of the training schemes for these tasks that can be most effective. Our findings reveal that fusing the feature representation from self-supervised auxiliary tasks is a powerful feature representation for the problem at hand. Such a representation can leverage the ultimate potential and bring in a unique representation of both the self-supervised and primary tasks, achieving better performance for the primary task. We experimented on a large set of datasets, which includes DF40, FaceForensics++, Celeb-DF, DFD, FaceShifter, UADFV, and our results showed better generalizability on cross-dataset evaluation when compared with current state-of-the-art detectors.[102] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI
Wenhui Chu,Nikolaos V. Tsekos
Main category: cs.CV
TL;DR: 提出两种新的深度学习架构LNU-Net和IBU-Net,用于短轴电影MRI图像中的左心室分割,性能优于现有方法。
Details
Motivation: 左心室分割对心脏影像的临床量化和诊断至关重要,现有方法仍有改进空间。 Method: 基于U-Net架构,分别引入层归一化(LNU-Net)和实例-批归一化(IBU-Net),并在卷积模块中应用归一化技术,结合仿射变换和弹性形变进行数据增强。 Result: 在包含805张MRI图像的数据集上实验表明,所提方法在Dice系数和平均垂直距离指标上优于其他先进方法。 Conclusion: LNU-Net和IBU-Net能有效提升左心室分割精度,具有良好的临床应用潜力。 Abstract: Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for left ventricle segmentation from short-axis cine MRI images. LNU-Net is derived from layer normalization (LN) U-Net architecture, while IBU-Net is derived from the instance-batch normalized (IB) U-Net for medical image segmentation. The architectures of LNU-Net and IBU-Net have a down-sampling path for feature extraction and an up-sampling path for precise localization. We use the original U-Net as the basic segmentation approach and compared it with our proposed architectures. Both LNU-Net and IBU-Net have left ventricle segmentation methods: LNU-Net applies layer normalization in each convolutional block, while IBU-Net incorporates instance and batch normalization together in the first convolutional block and passes its result to the next layer. Our method incorporates affine transformations and elastic deformations for image data processing. Our dataset that contains 805 MRI images regarding the left ventricle from 45 patients is used for evaluation. We experimentally evaluate the results of the proposed approaches outperforming the dice coefficient and the average perpendicular distance than other state-of-the-art approaches.[103] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Jiewen Chan,Zhenjun Zhao,Yu-Lun Liu
Main category: cs.CV
TL;DR: 本文提出AdaGaR,一种用于单目视频动态3D场景重建的统一框架,通过自适应Gabor表示和时间连续性建模,实现高频细节保留与运动平滑。