Table of Contents
cs.CL [Back]
[1] Quantum NLP models on Natural Language Inference
Ling Sun,Peter Sullivan,Michael Martin,Yun Zhou
Main category: cs.CL
TL;DR: 本文研究了量子自然语言处理(QNLP)在自然语言推断(NLI)任务中的应用,提出了一种新的参数效率度量指标IGPP,并展示了量子模型在少样本设置下以更少参数实现与经典模型相当甚至更优的性能。
Details
Motivation: 探索量子计算在自然语言处理中的潜力,特别是在低资源和结构敏感任务中,利用量子电路的组合性提升模型效率。 Method: 基于lambeq库和DisCoCat框架构建参数化量子电路,用于句子对的语义相关性和推理分类任务;引入信息增益每参数(IGPP)作为评估学习效率的新指标;提出基于词聚类的新型集群架构以促进参数共享。 Result: 量子模型在推理任务上优于随机初始化的Transformer,在相关性任务上测试误差更低;每参数学习效率比经典模型高最多五个数量级;新提出的集群架构提升了泛化能力。 Conclusion: 量子NLP模型在少样本、低资源场景下具有显著优势,展现出更高的参数效率和结构建模能力,为未来轻量级语义模型提供了有前景的方向。 Abstract: Quantum natural language processing (QNLP) offers a novel approach to semantic modeling by embedding compositional structure directly into quantum circuits. This paper investigates the application of QNLP models to the task of Natural Language Inference (NLI), comparing quantum, hybrid, and classical transformer-based models under a constrained few-shot setting. Using the lambeq library and the DisCoCat framework, we construct parameterized quantum circuits for sentence pairs and train them for both semantic relatedness and inference classification. To assess efficiency, we introduce a novel information-theoretic metric, Information Gain per Parameter (IGPP), which quantifies learning dynamics independent of model size. Our results demonstrate that quantum models achieve performance comparable to classical baselines while operating with dramatically fewer parameters. The Quantum-based models outperform randomly initialized transformers in inference and achieve lower test error on relatedness tasks. Moreover, quantum models exhibit significantly higher per-parameter learning efficiency (up to five orders of magnitude more than classical counterparts), highlighting the promise of QNLP in low-resource, structure-sensitive settings. To address circuit-level isolation and promote parameter sharing, we also propose a novel cluster-based architecture that improves generalization by tying gate parameters to learned word clusters rather than individual tokens.[2] Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus
Md Kamrul Siam,Md Jobair Hossain Faruk,Jerry Q. Cheng,Huanying Gu
Main category: cs.CL
TL;DR: 本研究提出一种基于ChatGPT和Claude的多模型融合框架,通过相似性共识和多模态输入提升胸部X光诊断的可靠性,在CheXpert数据集上显著提高准确率。
Details
Motivation: 为提高AI在放射学诊断中的可信度与临床实用性,减少诊断错误,探索多模型融合与多模态输入的协同效应。 Method: 采用图像提示评估单模态性能,并引入基于相似性的共识机制;生成符合MIMIC-CXR模板的合成临床文本以构建多模态输入,评估双模型在多模态下的表现。 Result: 单模态下Claude准确率为76.9%,ChatGPT为62.8%,共识机制提升至77.6%;多模态下ChatGPT达84%,Claude为76%,共识准确率达91.3%。 Conclusion: 基于共识的融合策略结合多模态输入能有效提升诊断准确性与可靠性,为降低AI辅助诊断误差提供了低计算开销的可行路径。 Abstract: This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. To assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.[3] Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
Guiyao Tie,Zenghui Yuan,Zeli Zhao,Chaoran Hu,Tianhe Gu,Ruihang Zhang,Sizhe Zhang,Junran Wu,Xiaoyue Tu,Ming Jin,Qingsong Wen,Lixing Chen,Pan Zhou,Lichao Sun
Main category: cs.CL
TL;DR: 本研究提出了CorrectBench,一个用于评估大语言模型自校正方法(包括内在、外在和微调方法)在常识推理、数学推理和代码生成任务上有效性的基准。研究发现自校正可提升复杂推理任务的准确性,混合策略效果更优但效率降低,而推理专用模型如DeepSeek-R1在额外自校正下优化有限且耗时高;相比之下,简单的思维链(CoT)基线在准确性和效率上均具竞争力。研究强调了自校正在提升LLM推理能力方面的潜力,同时指出效率优化仍是挑战,呼吁进一步研究以平衡推理能力与运行效率。
Details
Motivation: 尽管已有多种大语言模型自校正方法被提出,但缺乏对这些方法的系统性评估,且LLMs是否真能有效自我纠正仍存疑问。因此,亟需一个全面的基准来衡量不同自校正策略的有效性,并探究其在不同推理任务中的表现与效率权衡。 Method: 构建了一个名为CorrectBench的评估基准,涵盖三种自校正范式:内在自校正、外部反馈驱动的自校正和基于微调的自校正,并在三个典型推理任务(常识推理、数学推理、代码生成)上进行实验评估,比较多种自校正方法的表现、组合策略的影响以及时间开销。 Result: 1) 自校正方法能提升LLM在复杂推理任务上的准确性;2) 混合多种自校正策略可进一步提高性能,但牺牲了推理效率;3) 专为推理设计的模型(如DeepSeek-R1)在引入额外自校正后提升有限且时间成本高;4) 简单的思维链(CoT)基线在准确率和效率方面表现优异,具有强竞争力。 Conclusion: 自校正确实有助于提升大语言模型的推理性能,尤其是在复杂任务中,但当前方法在效率方面面临显著挑战。现有先进模型在自校正下的增益有限,而简单方法如CoT已表现出色。未来研究应聚焦于优化自校正过程的效率与性能之间的平衡,推动更高效、实用的推理增强机制。 Abstract: Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/[4] EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu,Xiaoman Wang,Jianbiao Mei,Pinlong Cai,Daocheng Fu,Cheng Yang,Licheng Wen,Xuemeng Yang,Yufan Shen,Yuxin Wang,Botian Shi
Main category: cs.CL
TL;DR: 本文提出了EvolveR框架,通过离线自蒸馏和在线交互的闭环经验生命周期,使大语言模型代理能够从自身经验中持续自我改进,在多跳问答任务中优于现有方法。
Details
Motivation: 现有LLM代理在工具使用上表现良好,但缺乏从自身经验中系统学习和迭代优化策略的能力,主要依赖外部知识而忽视了自我提升机制。 Method: 提出EvolveR框架,包含两个阶段:1)离线自蒸馏,将交互轨迹提炼为可复用的策略原则;2)在线交互,通过检索已提炼的原则指导决策,并积累新行为轨迹;结合策略强化机制实现闭环自我更新。 Result: 在复杂多跳问答基准上,EvolveR显著优于强基线代理方法,验证了其自我改进的有效性。 Conclusion: EvolveR为构建能从自身行为后果中学习的自主、持续进化代理提供了可行蓝图。 Abstract: Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.[5] Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification
Binglan Han,Anuradha Mathrani,Teo Susnjak
Main category: cs.CL
TL;DR: 该研究系统评估了六种大语言模型在五种提示策略下的表现,以自动化系统文献综述的筛选阶段。结果表明,模型与提示策略存在显著交互效应,推荐采用分阶段工作流以优化成本与性能。
Details
Motivation: 旨在量化不同提示策略与大语言模型在自动化文献筛选中的交互效果,提升系统文献综述的效率与准确性。 Method: 评估六种大语言模型(如GPT-4o、DeepSeek等)在五种提示类型(零样本、少样本、思维链等)下的表现,涵盖相关性分类和二级任务,使用准确率、精确率、召回率和F1分数作为指标,并进行成本效益分析。 Result: CoT-few-shot提示在精确率和召回率之间表现最均衡;零样本提示召回率最高,适合高敏感性初筛;自反思提示表现不佳;GPT-4o和DeepSeek整体表现稳健,GPT-4o-mini在低成本下表现优异。结构化提示在GPT-4o-mini上以小幅成本提升获得良好F1分数。 Conclusion: 大语言模型在文献筛选中具有不均衡但可观的潜力。通过模型-提示交互分析,研究提供了可比较的基准和任务自适应部署的实用建议,推荐低预算模型结合结构化提示用于初筛,仅将模糊案例交由高性能模型处理。 Abstract: This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.[6] Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
Tina Behnia,Puneesh Deora,Christos Thrampoulidis
Main category: cs.CL
TL;DR: 本文提出了一种灵活的合成测试平台,用于研究语言模型中统计规律性和事实关联之间的相互作用,发现上下文多样性和结构对模型在分布内和分布外事实泛化能力有重要影响,并揭示了嵌入层和解嵌层在优化瓶颈中的关键作用。
Details
Motivation: 现有研究缺乏对语言模型中统计规律性与事实关联交互影响的系统分析,尤其是不同上下文多样性如何影响模型泛化能力。 Method: 构建一个包含通用token的统计流和源-目标token对的事实流的合成测试平台,通过控制流的组成和多样性水平来精细调控二者交互,进行受控实验并干预模型组件以分析其影响。 Result: 发现较高的上下文多样性会延迟分布内准确率,但对分布外泛化的影响取决于上下文结构;某些情况下多样性无助于泛化,而在其他情况下则至关重要;训练时长影响最佳多样性水平;部分结构导致统计或事实泛化失败;OOD失败可归因于嵌入和解嵌层的优化瓶颈。 Conclusion: 上下文设计与多样性水平的相互作用显著影响语言模型的不同泛化方面,该合成框架为未来研究提供了可控制的实验环境。 Abstract: Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.[7] In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Aria Pessianzadeh,Naima Sultana,Hildegarde Van den Bulck,David Gefen,Shahin Jabari,Rezvaneh Rezapour
Main category: cs.CL
TL;DR: 本研究首次通过计算方法分析了2022-2025年Reddit上关于生成式AI的信任与不信任,使用大规模数据和分类模型揭示公众态度的动态变化。
Details
Motivation: 现有对AI信任的研究多依赖心理学和人机交互视角,缺乏可扩展的、长期的计算方法来衡量公众对生成式AI和大语言模型的信任与不信任。 Method: 基于2022-2025年Reddit上39个子版块共197,618篇帖子的多阶段数据,结合众包标注与分类模型,进行大规模纵向分析。 Result: 发现信任与不信任总体趋于平衡,重大模型发布前后出现波动;技术性能和可用性是主要关注维度,个人经验是最常见的态度成因;不同群体(如专家、伦理学者、普通用户)表现出不同的信任模式。 Conclusion: 该研究提供了一种用于大规模分析AI信任的框架,并揭示了公众对生成式AI认知的演变过程,为负责任的AI治理提供了实证基础。 Abstract: The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.[8] EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture
Mohamed Gamil,Abdelrahman Elsayed,Abdelrahman Lila,Ahmed Gad,Hesham Abdelgawad,Mohamed Aref,Ahmed Fares
Main category: cs.CL
TL;DR: 本文介绍了EgMM-Corpus,一个专注于埃及文化的多模态数据集,包含3000多张图像,覆盖地标、食物和民间传说等313个概念,旨在评估和训练视觉-语言模型,并揭示现有模型中的文化偏见。
Details
Motivation: 由于中东和非洲地区缺乏多模态文化多样性数据集,作者希望构建一个专注于埃及文化的高质量、文化真实性强的多模态数据集,以支持文化感知模型的发展。 Method: 设计并实施新的数据收集流程,收集超过3000张图像,涵盖313个文化概念,并对每条数据进行人工验证以确保文化真实性与多模态一致性;使用CLIP模型进行零样本分类性能评估。 Result: 在EgMM-Corpus上,CLIP模型的Top-1准确率为21.2%,Top-5准确率为36.4%,显示出当前视觉-语言模型存在显著的文化偏见。 Conclusion: EgMM-Corpus是一个可靠的埃及文化多模态资源,可用于训练和评估文化感知的视觉-语言模型,有助于缓解主流AI模型中的文化偏差问题。 Abstract: Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.[9] What Can String Probability Tell Us About Grammaticality?
Jennifer Hu,Ethan Gotlieb Wilcox,Siyuan Song,Kyle Mahowald,Roger P. Levy
Main category: cs.CL
TL;DR: 本文探讨了语言模型(LMs)对语法知识的学习,提出了一种基于语料库生成过程假设的理论分析框架,并通过英语和中文的28万对最小对立句验证了三个预测,为利用概率研究LMs的结构知识提供了理论依据。
Details
Motivation: 由于语言学中概率与语法性是两个不同的概念,因此语言模型的字符串概率能否反映其语法知识仍存在争议,本文旨在澄清这一问题。 Method: 基于语料库生成过程的简单假设,构建了一个关于语法、意义和字符串概率之间关系的理论分析框架,并在28万个英语和中文最小对立句对上进行实证验证。 Result: 框架提出了三个可验证的预测:(1) 最小对立句对内字符串概率的相关性;(2) 模型与人类在最小对立句对中的差异相关性;(3) 语法正确与错误字符串在概率空间中难以区分。实验结果支持这些预测。 Conclusion: 该研究为使用概率来探究语言模型的结构知识提供了理论基础,并指出了未来在语言模型语法评估方面的研究方向。 Abstract: What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.[10] Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback
Chu Fei Luo,Samuel Dahan,Xiaodan Zhu
Main category: cs.CL
TL;DR: 本文提出通过多元化解码和模型引导两种方法,在低资源环境下提升语言模型对多元观点的对齐能力,尤其在仇恨言论和虚假信息检测等高风险任务中减少了误报,并改善了与人类价值观的分布对齐。
Details
Motivation: 现有的语言模型训练范式通常假设每个问题只有一个最优答案,导致回应过于泛化且难以反映人类价值观的多样性与细微差异,因此需要增强模型对多元视角的对齐能力。 Method: 提出并结合使用多元化解码和模型引导方法,在仅有50个标注样本的低资源条件下进行实验,以提升语言模型输出的多样性与价值对齐。 Result: 模型引导方法相比零样本和少样本基线持续提升性能;在多个高风险任务中降低了误报率,并在GlobalOpinionQA上改善了与人类价值观的分布对齐。 Conclusion: 语言模型应能反映多元和细致的人类价值观,本文方法展示了即使在低资源条件下也能有效提升模型的多元化对齐能力,强调了多样性在模型设计中的重要性。 Abstract: As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.[11] Instant Personalized Large Language Model Adaptation via Hypernetwork
Zhaoxuan Tan,Zixuan Zhang,Haoyang Wen,Zheng Li,Rongzhi Zhang,Pei Chen,Fengran Mo,Zheyuan Liu,Qingkai Zeng,Qingyu Yin,Meng Jiang
Main category: cs.CL
TL;DR: 提出Profile-to-PEFT框架,利用超网络将用户画像直接映射到适配器参数,实现无需每用户训练的高效个性化大模型。
Details
Motivation: 现有个性化大语言模型方法(如OPPU)需为每个用户单独训练适配器,计算成本高且难以实时更新,缺乏可扩展性。 Method: 设计一个端到端训练的超网络,将编码后的用户画像直接生成完整的适配器参数(如LoRA),实现零样本用户泛化和即时适配。 Result: 实验表明该方法在部署时计算资源消耗显著低于OPPU,性能优于提示学习和OPPU方法,并对分布外用户、不同活跃度及嵌入主干具有良好鲁棒性和泛化性。 Conclusion: Profile-to-PEFT实现了高效、可扩展且自适应的LLM个性化,适用于大规模应用场景。 Abstract: Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.[12] Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models
Pratham Singla,Shivank Garg,Ayush Singh,Ishan Garg,Ketan Suhaas Saichandran
Main category: cs.CL
TL;DR: 该论文研究了经过后训练的大型语言模型在处理复杂逻辑任务时的自我认知能力,定义了三项核心能力,并比较了不同训练方法(SFT、DPO、GRPO)下模型的表现,发现强化学习训练的模型具有更强的策略意识和泛化能力,但其推理过程与最终输出之间的对齐较弱,尤其是在GRPO模型中更为明显。
Details
Motivation: 探讨大型语言模型是否真正‘理解’其所学的策略和思维过程,评估其在生成规划标记后的自我认知能力。 Method: 定义了三个核心能力:对隐性策略的学习意识、跨领域的泛化能力以及内部推理轨迹与最终输出的一致性,并在多个需要不同策略的任务上进行实证评估,对比SFT、DPO和GRPO训练的模型表现。 Result: RL训练的模型(如DPO和GRPO)比SFT模型展现出更强的策略意识和跨任务泛化能力,但在推理路径与最终答案之间存在较弱的对齐,尤其是GRPO模型最为显著。 Conclusion: 尽管RL后训练提升了模型的策略学习和泛化能力,但其内部推理与输出之间的不一致提示当前模型可能并未真正‘理解’其思考过程,这对可解释性和可靠性提出了挑战。 Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.[13] Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets
Utsav Dhanuka,Soham Poddar,Saptarshi Ghosh
Main category: cs.CL
TL;DR: 该研究探索了利用大语言模型(LLM)实时生成针对疫苗错误信息的有效反驳论点,结合分类器对反疫苗推文进行多标签分类,并通过多种评估方式验证了方法的有效性。
Details
Motivation: 应对社交媒体上广泛传播的疫苗错误信息,提高公众对疫苗的信任和接种率,弥补现有研究在实时、个性化辟谣方面的不足。 Method: 采用不同的提示策略和微调方法优化LLM生成反论点的能力,并训练分类器将反疫苗推文按主题(如副作用、疗效、政治影响等)进行多标签分类,以实现情境感知的回应。 Result: 人类评估、基于LLM的评估和自动指标显示高度一致,结果表明结合标签描述和结构化微调能显著提升反驳效果。 Conclusion: 整合多标签分类与结构化微调的LLM方法在大规模应对疫苗错误信息方面具有巨大潜力。 Abstract: In an era where public health is increasingly influenced by information shared on social media, combatting vaccine skepticism and misinformation has become a critical societal goal. Misleading narratives around vaccination have spread widely, creating barriers to achieving high immunisation rates and undermining trust in health recommendations. While efforts to detect misinformation have made significant progress, the generation of real time counter-arguments tailored to debunk such claims remains an insufficiently explored area. In this work, we explore the capabilities of LLMs to generate sound counter-argument rebuttals to vaccine misinformation. Building on prior research in misinformation debunking, we experiment with various prompting strategies and fine-tuning approaches to optimise counter-argument generation. Additionally, we train classifiers to categorise anti-vaccine tweets into multi-labeled categories such as concerns about vaccine efficacy, side effects, and political influences allowing for more context aware rebuttals. Our evaluation, conducted through human judgment, LLM based assessments, and automatic metrics, reveals strong alignment across these methods. Our findings demonstrate that integrating label descriptions and structured fine-tuning enhances counter-argument effectiveness, offering a promising approach for mitigating vaccine misinformation at scale.[14] End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction
Nilmadhab Das,Vishal Vaibhav,Yash Sunil Choudhary,V. Vijaya Saradhi,Ashish Anand
Main category: cs.CL
TL;DR: 本文提出了一种端到端的自动回归论证结构预测框架(AASP),用于联合建模论证挖掘中的论证成分和关系,通过预定义动作序列逐步构建论证结构,在多个基准上实现了最先进或强有力的结果。
Details
Motivation: 由于论证挖掘任务中推理复杂,建模论证成分与关系之间的依赖具有挑战性,现有方法多采用生成式范式展平结构,难以捕捉论证流。 Method: 提出基于条件预训练语言模型的自回归论证结构预测(AASP)框架,将论证结构建模为受限的预定义动作集,逐步生成论证成分及其关系,实现端到端联合建模。 Result: 在三个标准论证挖掘基准上实验表明,AASP在两个基准上取得当前最优结果,在另一个上表现强劲。 Conclusion: AASP能有效捕捉论证推理流程,通过自回归结构预测显著提升论证挖掘性能,是处理复杂论证结构的一种高效且统一的端到端方法。 Abstract: Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.[15] Navigating through the hidden embedding space: steering LLMs to improve mental health assessment
Federico Ravenda,Seyed Ali Bahrainian,Andrea Raballo,Antonietta Mira
Main category: cs.CL
TL;DR: 提出一种轻量级且高效的方法,通过线性变换和引导向量提升大语言模型在心理健康评估中的表现,无需复杂计算,在相关性预测和抑郁筛查问卷填写任务中均取得改进。
Details
Motivation: 尽管大语言模型发展迅速,但在特定领域(如心理健康)应用时,小规模模型仍表现不佳,需要低成本且有效的优化方法。 Method: 采用对特定层激活值进行线性变换的轻量级方法,利用引导向量(steering vectors)调节模型输出,实现领域适应。 Result: 该方法在两个任务上提升了模型性能:1)判断Reddit帖子是否有助于检测抑郁症状;2)基于用户发帖历史完成标准化抑郁筛查问卷。 Conclusion: 引导机制是一种计算高效的工具,具有显著潜力用于大语言模型在心理健康领域的适应与优化。 Abstract: The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer's activations, leveraging steering vectors to guide the model's output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users' Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs' MH domain adaptation.[16] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Yu Ying Chiu,Michael S. Lee,Rachel Calcott,Brandon Handoko,Paul de Font-Reaulx,Paula Rodriguez,Chen Bo Calvin Zhang,Ziwen Han,Udari Madhushani Sehwag,Yash Maurya,Christina Q Knight,Harry R. Lloyd,Florence Bacus,Mantas Mazeika,Bing Liu,Yejin Choi,Mitchell L Gordon,Sydney Levine
Main category: cs.CL
TL;DR: 本文提出了MoReBench和MoReBench-Theory两个基准,用于评估语言模型在道德困境中的程序性推理能力,发现现有模型在道德推理上存在偏倚,且传统缩放规律无法预测其表现。
Details
Motivation: 随着AI越来越多地参与决策,确保其决策符合人类价值观至关重要。理解AI如何进行道德推理成为关键问题,而现有基准多关注结果而非推理过程,因此需要专注于过程评估的测试平台。 Method: 构建包含1000个道德场景的MoReBench数据集,每个场景配有专家制定的推理评分标准(共2.3万条),涵盖识别道德考量、权衡取舍和提供建议等;另构建MoReBench-Theory(150个例子)以测试AI在五种规范伦理框架下的推理能力。对主流语言模型进行评测并分析其推理路径。 Result: 实验表明,模型在数学、代码等任务上的性能缩放规律无法预测其在道德推理任务上的表现;模型表现出对特定道德框架(如边沁功利主义和康德义务论)的偏好,这可能是当前训练范式带来的副作用。 Conclusion: MoReBench系列基准推动了以过程为中心的AI推理评估,有助于实现更安全、透明的AI系统,强调需专门设计针对道德推理的评估方法。 Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.[17] ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents
David Peer,Sebastian Stabinger
Main category: cs.CL
TL;DR: 本文提出了一种名为自主可信代理(ATA)的神经符号方法,通过将任务分解为离线知识摄入和在线任务处理两个阶段,提升大语言模型在高风险领域的可信性。
Details
Motivation: 大语言模型存在幻觉、不稳定性和缺乏透明度等问题,限制了其在关键场景中的应用,因此需要一种更可靠、可验证的方法来增强其信任度。 Method: 将任务分为两个阶段:离线阶段使用大语言模型将非正式问题描述转化为形式化的符号知识库;在线阶段将输入编码为相同的形式语言,并由符号决策引擎结合知识库进行推理得出结果。 Result: 实验表明,ATA在完全自动化设置下与最先进的端到端推理模型性能相当;在人类验证和修正知识库后,性能显著超越更大模型,且具有完美确定性、更强的抗输入扰动能力以及对提示注入攻击的天然免疫性。 Conclusion: ATA通过基于符号推理的决策机制,提供了一种实用、可控、透明、可审计且可靠的下一代自主代理架构。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.[18] Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
Fu-An Chao,Bi-Cheng Yan,Berlin Chen
Main category: cs.CL
TL;DR: 本研究探索了Whisper在二语口语评估中的潜力,通过提取其隐藏表示中的声学和语言特征,在无需任务微调的情况下实现了优于现有方法的性能。
Details
Motivation: 现有研究多依赖Whisper生成的转录文本进行外部分析,未能充分挖掘其内在表征能力。本文旨在探究Whisper在L2口语评估中未被开发的潜层特征潜力。 Method: 从Whisper的中间和最终输出中提取声学与语言特征,仅训练轻量级分类器;并引入图像和文本提示作为辅助相关性线索以提升性能。 Result: 在GEPT看图说话数据集上表现优异,超越包括多模态方法在内的现有最先进基线;分析表明Whisper的嵌入能自然编码语音的等级熟练度模式和语义信息。 Conclusion: Whisper无需任务特定微调即具备强大的口语评估潜力,可作为口语理解任务的有效基础模型。 Abstract: In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.[19] FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan,Md Farhan Ishmam,Abdullah Al Imran,Mohammad Ali Moni
Main category: cs.CL
TL;DR: 本文提出了一种名为FrugalPrompt的提示压缩框架,通过保留输入中语义最重要的token来减少大语言模型的上下文冗余,在多个NLP任务上实现了显著的prompt压缩,同时保持了较好的性能,揭示了不同任务对上下文稀疏性的容忍度差异。
Details
Motivation: 大语言模型因使用长输入上下文而带来高昂成本和延迟,但其中许多token语义贡献低,存在大量冗余,因此需要一种有效方法去除低效token以提升效率。 Method: 利用GlobEnc和DecompX两种先进的token归因方法为输入token分配显著性分数,保留得分最高的前k% token并保持其原始顺序,生成稀疏化的高效提示。 Result: 在情感分析、常识问答和摘要任务中,20%的prompt压缩仅导致性能轻微下降;但在数学推理任务中性能显著下降;使用最低或随机token的效果远差于高分token,且发现可能存在任务污染效应。 Conclusion: FrugalPrompt能有效压缩prompt并维持多数任务性能,表明当前LLM可从高显著性线索重建上下文,但数学推理等任务仍依赖完整序列;研究揭示了不同任务在效率与性能权衡中的本质差异。 Abstract: Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL[20] TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
Bin Yu,Xinming Wang,Shijie Lian,Haotian Li,Changti Wu,Ruina Hu,Bailing Wang,Yuliang Wei,Kai Chen
Main category: cs.CL
TL;DR: 提出TrajSelector,一种高效的Best-of-N框架,利用采样器LLM的隐藏状态进行推理轨迹评分,通过轻量级验证器实现显著性能提升且降低推理成本。
Details
Motivation: 解决现有外部测试时扩展方法计算开销高、未充分利用LLM内部表征的问题。 Method: 设计一个轻量级验证器(0.6B参数)评估逐步推理轨迹,并聚合分数选择最优路径;采用端到端数据驱动训练,无需大量步骤级标注。 Result: 在五个基准上验证,Best-of-32设置下比多数投票准确率高4.61%,优于现有过程奖励模型4.31%~12.21%,且推理成本更低。 Conclusion: TrajSelector有效平衡了性能与效率,为复杂推理任务提供了更具可扩展性的解决方案。 Abstract: Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.[21] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning
Deyi Ji,Yuekui Yang,Haiyang Wu,Shaoping Ma,Tianrun Chen,Lanyun Zhu
Main category: cs.CL
TL;DR: 提出RAVEN框架,结合课程强化学习与多模态大语言模型,通过渐进训练和分组相对策略优化实现广告视频违规内容的精准检测与时间定位。
Details
Motivation: 现有方法在广告视频违规检测中面临时间定位不准、标注噪声和泛化能力差的问题,需要提升模型的推理能力和实际应用性能。 Method: 提出RAVEN框架,采用课程强化学习与多模态大语言模型结合,利用精确和粗略标注数据进行渐进训练,引入分组相对策略优化(GRPO)和多层次奖励机制,实现无需显式推理标注的自主推理能力发展。 Result: 在工业数据集和公开基准上实验表明,RAVEN在违规类别准确率和时间区间定位方面表现优越;在线A/B测试验证其实际有效性,显著提升精度和召回率,并展现出强泛化能力,缓解灾难性遗忘问题。 Conclusion: RAVEN通过融合强化学习与多模态大模型,有效提升了广告视频违规检测的准确性、时间定位能力和实际部署效果,具备良好的泛化性和应用前景。 Abstract: Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.[22] Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Benjamin Roth,Barbara Plank
Main category: cs.CL
TL;DR: 该论文通过应用LiTEx分类法分析自然语言推断(NLI)数据集中标注者的标签和推理差异,发现尽管标注者在标签上存在分歧,但其解释可能高度相似,表明表层分歧下可能存在语义理解的一致性;研究强调基于推理的解释能更准确反映语义相似性,并警示将标签视为绝对真理的风险。
Details
Motivation: 理解NLI数据集中人类标注的变异,特别是标注者在标签一致情况下推理过程的差异,并扩展至标签不一致的情形。 Method: 应用LiTEx分类法对两个英文NLI数据集中的自由文本解释进行分析,从标签一致性、解释相似性、分类法一致性和标注者选择偏差等多个维度对标注变异进行对齐分析。 Result: 发现部分标注者虽在标签上不一致,但其解释高度相似,表明表面标签分歧可能掩盖深层语义理解的一致;同时揭示了标注者在解释策略和标签选择上的个体偏好;推理类型的一致性比标签一致性更能反映解释的语义相似性。 Conclusion: 基于推理的解释揭示了NLI标注中丰富的个体差异和深层一致性,强调仅依赖标签作为‘真实标签’存在风险,应更重视解释所体现的语义理解过程。 Abstract: Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning types. However, previous work applying such taxonomies has focused on within-label variation: cases where annotators agree on the final NLI label but provide different explanations. In contrast, this paper broadens the scope by examining how annotators may diverge not only in the reasoning type but also in the labeling step. We use explanations as a lens to decompose the reasoning process underlying NLI annotation and to analyze individual differences. We apply LiTEx to two NLI English datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide highly similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning types better reflects the semantic similarity of free-text explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.[23] Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Vamshi Krishna Bonagiri,Ponnurangam Kumaragurum,Khanh Nguyen,Benjamin Plaut
Main category: cs.CL
TL;DR: 本文提出“退出”机制作为大语言模型代理在不确定情境下的安全行为策略,通过显式指令提示代理在缺乏信心时主动退出,从而提升安全性,同时几乎不影响其有用性。
Details
Motivation: 随着大语言模型代理在复杂现实环境中的应用增多,其安全性至关重要。传统不确定性量化方法难以应对多轮交互和工具使用带来的累积风险,因此需要新的安全机制。 Method: 基于ToolEmu框架,系统评估了12种最先进大语言模型在显式退出指令下的表现,衡量其在安全性和有用性上的权衡。 Result: 引入显式退出指令后,所有模型的安全性平均提升0.39分(满分3分),其中专有模型提升达0.64分,而有用性仅轻微下降0.03分。 Conclusion: 显式退出指令是一种简单、有效且可立即部署的安全机制,能够作为高风险应用中自主代理的第一道防线。 Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.[24] Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection
Michelle Yuan,Khushbu Pahwa,Shuaichen Chang,Mustafa Kaba,Jiarong Jiang,Xiaofei Ma,Yi Zhang,Monica Sunkara
Main category: cs.CL
TL;DR: 本文提出了一种受背包问题启发的自动化框架,用于在动态环境中高效组合智能体系统,通过实时评估组件性能、成本和兼容性,显著提升了任务成功率并降低了资源开销。
Details
Motivation: 现有方法依赖静态语义检索进行工具或智能体发现,难以准确评估组件的实际能力、成本和实时效用,导致组件选择效果不佳。 Method: 提出一种结构化的自动组合框架,将智能体系统构建建模为在线背包问题,由一个composer智能体动态测试候选组件并实时建模其效用,在性能、预算和兼容性约束下选择最优组件集合。 Result: 在五个基准数据集上使用Claude 3.5 Sonnet进行实验,结果表明该方法在帕累托前沿上优于基线:单智能体场景下成功率最高提升31.6%;多智能体场景下,从100多个智能体中选型时成功率从37%提升至87%。 Conclusion: 所提出的在线背包式composer能有效实现智能体系统的高效组装与资源复用,在不同领域和预算限制下均表现出强大的适应性和性能优势。 Abstract: Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.[25] ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation
Haoxuan Zhang,Ruochi Li,Sarthak Shrestha,Shree Harshini Mamidala,Revanth Putta,Arka Krishan Aggarwal,Ting Xiao,Junhua Ding,Haihua Chen
Main category: cs.CL
TL;DR: 本文提出了ReviewGuard,一个基于大语言模型(LLM)的自动化系统,用于检测和分类有缺陷的同行评审,以应对AI生成内容对学术评审系统的挑战。
Details
Motivation: 随着投稿量激增及大语言模型在学术评审中的广泛应用,人类和AI产生的低质量评审威胁到同行评审体系的完整性和学术诚信,亟需有效工具识别这些问题评审。 Method: ReviewGuard采用四阶段LLM驱动框架:1)从OpenReview收集ICLR和NeurIPS论文及其评审;2)使用GPT-4.1标注评审类型并辅以人工验证;3)通过LLM生成合成数据缓解类别不平衡和数据稀缺问题;4)微调编码器模型和开源LLM进行检测。 Result: 构建了包含6,634篇论文、24,657条真实评审和46,438条合成评审的数据集;发现缺陷评审具有评分较低、自信度高、结构简单、负面情绪多等特点;AI生成内容检测显示ChatGPT出现后AI评审显著增加;混合训练显著提升二分类任务的召回率和F1分数。 Conclusion: ReviewGuard是首个用于检测缺陷同行评审的LLM驱动系统,为规范AI在评审中的应用提供了实证支持,并为维护学术诚信下的人机协作提供了重要见解。 Abstract: Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. Recent work has focused on using LLMs to improve review efficiency or generate insightful review content. However, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine the peer review ecosystem and compromise academic integrity. To address this critical issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews. ReviewGuard employs a comprehensive four-stage LLM-driven framework that: (1) collects ICLR and NeurIPS papers with their corresponding reviews from OpenReview; (2) annotates review types using GPT-4.1 with human validation; (3) addresses class imbalance and data scarcity through LLM-driven synthetic data augmentation, producing a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews; and (4) fine-tunes both encoder-based models and open source LLMs. We perform comprehensive feature analysis of the structure and quality of the review text. Compared to sufficient reviews, deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment. AI-generated text detection reveals that, since ChatGPT's emergence, AI-generated reviews have increased dramatically. In the evaluation of deficient review detection models, mixed training with synthetic and real review data provides substantial enhancements to recall and F1 scores on the binary task. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review while offering valuable insights into human-AI collaboration to maintain academic integrity.[26] Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models
Seungho Cho,Changgeon Ko,Eui Jun Hwang,Junmyeong Lee,Huije Lee,Jong C. Park
Main category: cs.CL
TL;DR: 本研究通过分析大语言模型在回答语义相同但文化和语言条件不同的问题时的内部激活路径重叠情况,揭示其文化理解机制。结果表明,语言对内部路径的影响强于文化,且语言相似并不保证内部表征一致,如朝韩案例所示。
Details
Motivation: 大语言模型在全球不同文化背景下应用广泛,但现有评估多关注输出表现,缺乏对内部文化理解机制的深入探究。同时,电路分析研究覆盖语言有限,且很少聚焦文化因素。因此,需要系统分析模型内部如何处理语言与文化的交互。 Method: 通过测量大语言模型在两种条件下回答语义等价问题时的内部激活路径重叠:一是固定问题语言而改变目标国家,二是固定国家而改变问题语言。使用同语言国家对以分离语言与文化因素的影响。 Result: 发现同语言跨国家问题的内部路径重叠度高于跨语言同国家问题,表明模型具有强烈的语言特异性模式。特别地,韩国与朝鲜这对语言相似的国家表现出低路径重叠和高变异性,说明语言相似性不保证内部表征一致性。 Conclusion: 大语言模型的内部文化理解受语言影响显著,语言因素比文化因素更主导其推理路径;语言相似性不足以确保文化相关问题的内部表征对齐,提示需专门建模文化敏感性。 Abstract: Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs' internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.[27] Hallucination Benchmark for Speech Foundation Models
Alkis Koudounas,Moreno La Quatra,Manuel Giollo,Sabato Marco Siniscalchi,Elena Baralis
Main category: cs.CL
TL;DR: 本文提出了SHALLOW,首个系统分类和量化ASR中幻觉现象的基准框架,涵盖词汇、语音、形态和语义四个维度,并定义了针对性指标以生成可解释的模型行为画像。
Details
Motivation: 现有的ASR评估指标主要基于错误率,难以区分语音错误与具有语义合理性的幻觉错误,尤其在医疗、法律等关键领域可能带来严重风险,因此需要更精细的评估框架。 Method: 提出SHALLOW框架,从词汇、语音、形态和语义四个互补维度对ASR中的幻觉进行分类与量化,设计各维度的专用指标,并在多种模型架构和语音场景下进行评估分析。 Result: 实验表明SHALLOW指标在识别质量高时与WER强相关,但在WER升高时相关性显著减弱,说明其能捕捉WER在恶劣条件下无法反映的细粒度错误模式。 Conclusion: SHALLOW能够有效识别和评估ASR模型产生幻觉的倾向,支持对模型弱点的具体诊断,为超越传统聚合错误率的模型改进提供反馈。 Abstract: Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.[28] AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu
Muhammad Ammar,Hadiya Murad Hadi,Usman Majeed Butt
Main category: cs.CL
TL;DR: 提出了一种针对乌尔都语的AI生成文本检测框架,通过构建平衡数据集并利用多语言Transformer模型进行微调,mDeBERTa-v3-base取得了91.29%的F1分数和91.26%的准确率。
Details
Motivation: 由于大语言模型生成的文本越来越接近人类写作,区分人机文本变得困难,尤其是在乌尔都语等资源匮乏的语言中缺乏有效的检测工具。 Method: 构建包含1800个人类撰写和1800个AI生成乌尔都语文本的数据集,分析字符/词频、词汇丰富度(TTR)和N-gram模式,并使用mdeberta-v3-base、distilbert-base-multilingual-cased和xlm-roberta-base三种多语言Transformer模型进行微调。 Result: mDeBERTa-v3-base在测试集上表现最佳,F1得分为91.29%,准确率为91.26%;统计检验显示人类与AI文本在语言特征上有显著差异。 Conclusion: 该研究推动了乌尔都语环境中对抗虚假信息和学术不端的努力,也为低资源语言的NLP工具发展提供了支持。 Abstract: Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.[29] Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach
Francisco Jose Cortes Delgado,Eduardo Martinez Gracia,Rafael Valencia Garcia
Main category: cs.CL
TL;DR: 本文探讨了通过微调大型语言模型(LLMs)将输入句子翻译为其对应句法结构的新方法,以增强用于西班牙语教学的MiSintaxis工具。
Details
Motivation: 利用最新的自然语言处理技术改进语法分析工具,特别是在西班牙语教学中的应用。 Method: 从Hugging Face仓库中选取多个模型,并使用AnCora-ES语料库生成的训练数据进行微调,采用F1分数评估模型性能。 Result: 实验结果表明,该方法在短语结构分析上表现出高准确性。 Conclusion: 所提出的方法具有潜力,能够有效提升句法分析的质量和教学工具的功能。 Abstract: Recent advances in natural language processing with large neural models have opened new possibilities for syntactic analysis based on machine learning. This work explores a novel approach to phrase-structure analysis by fine-tuning large language models (LLMs) to translate an input sentence into its corresponding syntactic structure. The main objective is to extend the capabilities of MiSintaxis, a tool designed for teaching Spanish syntax. Several models from the Hugging Face repository were fine-tuned using training data generated from the AnCora-ES corpus, and their performance was evaluated using the F1 score. The results demonstrate high accuracy in phrase-structure analysis and highlight the potential of this methodology.[30] Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration
Zhixuan He,Yue Feng
Main category: cs.CL
TL;DR: 本文提出了DiMo框架,通过四个具有不同推理模式的LLM代理进行结构化辩论,提升大语言模型的性能与可解释性,尤其在数学任务上表现突出,并支持语义标注和Web原生知识集成。
Details
Motivation: 大语言模型虽然性能强大,但缺乏可解释的推理过程。为了增强模型决策的透明性和可靠性,需要一种能够模拟多样化思维并提供清晰推理链的方法。 Method: 提出Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo),使用四个专业化LLM代理,每个代理代表一种独特的推理范式,通过迭代辩论机制共同探索多种认知路径,生成更稳健的结论和可审计的推理链。 Result: 在六个基准测试中,DiMo在统一开源设置下优于单模型和辩论基线方法,尤其在数学任务上提升显著;同时生成语义类型化、URL注释的证据链,支持下游系统复用。 Conclusion: DiMo是一种语义感知、Web原生的多代理框架,有效结合了检索增强推理与结构化解释,为人类-机器智能协作提供了可解释、可扩展的解决方案。 Abstract: Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.[31] All You Need is One: Capsule Prompt Tuning with a Single Vector
Yiyang Liu,James C. Liang,Heng Fan,Wenhao Yang,Yiming Cui,Xiaotian Han,Lifu Huang,Dongfang Liu,Qifan Wang,Cheng Han
Main category: cs.CL
TL;DR: 本文提出了一种名为Capsule Prompt-Tuning (CaPT)的新方法,通过将实例感知信息融入提示中,提升大语言模型在下游任务中的性能,同时保持极高的参数效率。
Details
Motivation: 现有的基于提示的学习方法依赖繁琐的网格搜索且缺乏实例感知信息,导致注意力机制与输入序列的交互不足。 Method: 提出Capsule Prompt-Tuning (CaPT),将实例语义信息以近乎无参数的方式融入提示,利用“注意力锚点”现象,在序列最前端引入实例感知标记以增强注意力交互。 Result: 在多种语言任务上表现出优越性能(如T5-Large平均准确率达84.03%),并实现高参数效率(如在Llama3.2-1B上仅更新0.003%参数)。 Conclusion: CaPT通过融合任务感知与实例感知信息,有效增强了提示调优的注意力机制和模型表现,是一种高效且有效的参数微调方法。 Abstract: Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03\% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003\% of model parameters on Llama3.2-1B).[32] Temporal Understanding under Deictic Frame of Reference
Damin Zhang,Julia Rayz
Main category: cs.CL
TL;DR: 本文提出了TUuD框架,用于评估大语言模型在动态变化的时间参考点下对时间事件关系的理解能力。研究表明,虽然大语言模型表现出一定程度的人类-like时间认知特性,但其对远期语境的适应能力较弱,说明其时间推理仍受限于参考系变化和时间距离。
Details
Motivation: 理解时间是人类认知的基础,而大语言模型在时间理解和推理方面仍有局限。因此,研究旨在探究大语言模型如何根据‘现在’这一参考点的变化来感知时间关系。 Method: 提出TUuD框架,通过让大语言模型对当前时刻与目标事件之间的相似度进行评分(0.00到1.00),评估其在动态时间参考系下的时间理解能力。 Result: 四个被评估的大语言模型在近端时间背景下表现出对时间参考系的适应性,相似度评分在当前时刻最高,并向过去和未来递减;但在远期语境中这种适应性减弱。 Conclusion: 大语言模型展现出部分类似人类的时间认知能力,但其时间推理在面对参考系变化和较长时间跨度时仍然存在局限。 Abstract: Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, "summer is approaching" parallels "We are approaching the summer". In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now". While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.[33] Investigating the Impact of Rationales for LLMs on Natural Language Understanding
Wenhang Shi,Shuqing Bian,Yiren Chen,Xinyi Zhang,Zhe Zhao,Pengfei Hu,Wei Lu,Xiaoyong Du
Main category: cs.CL
TL;DR: 本文探讨了链式思维(CoT)推理在自然语言理解(NLU)任务中的作用,构建了带推理链的高质量NLU数据集NLURC,并提出多种基于推理链的增强方法。实验发现:随着模型规模增大,CoT推理从损害性能转变为优于直接预测;多数推理增强训练方法表现不佳,但一种特定设计的方法能持续提升性能;使用推理链训练的模型在未见NLU任务上表现优异,可媲美十倍规模的模型,同时具备良好可解释性。
Details
Motivation: 现有研究主要关注推理链在数学和常识推理中的作用,忽视其在自然语言理解(NLU)任务中的潜力。本文旨在系统探究推理链是否也能提升NLU任务的性能。 Method: 构建了一个包含推理链的高质量、综合性NLU数据集NLURC,并开发了多种推理链增强的训练与推理方法,包括在推理阶段生成推理链以及在训练中将推理链与标签结合的不同策略。 Result: (1)CoT推理在小模型上可能损害NLU性能,但随模型规模增大转为显著提升;(2)大多数推理链增强训练方法表现不如仅使用标签的训练,仅有一种专门设计的方法能稳定提升性能;(3)使用该方法训练的模型在未见NLU任务上性能接近十倍规模的商业模型,并具有良好的可解释性。 Conclusion: 推理链在NLU任务中的作用依赖于模型规模和训练方法设计,合理利用推理链不仅能提升模型性能和泛化能力,还能增强可解释性,为未来高效、透明的NLU模型训练提供新方向。 Abstract: Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.[34] Natural Language Processing Applications in Cardiology: A Narrative Review
Kailai Yang,Yan Leng,Xin Zhang,Tianlin Zhang,Paul Thompson,Bernard Keavney,Maciej Tomaszewski,Sophia Ananiadou
Main category: cs.CL
TL;DR: 本文综述了2014至2025年间自然语言处理(NLP)在心脏病学中的研究进展,分析了265篇相关文献,涵盖NLP范式、任务类型、心血管疾病种类和数据来源等多个维度,揭示了该领域的多样性与发展趋势。
Details
Motivation: 心血管疾病日益普遍,涉及多种复杂因素,相关信息分散在各类文本数据中,亟需有效方法整合与分析这些非结构化数据以提升诊疗水平。 Method: 通过查询六个文献数据库,筛选出265篇应用NLP技术于心血管疾病的研究文章,并从NLP范式、任务类型、疾病类型和数据来源等方面进行多维度分析,同时开展时间趋势分析。 Result: 发现NLP在心脏病学中的应用具有高度多样性,不同维度均表现出丰富的方法和技术;时间分析显示过去十年中NLP方法持续演进,深度学习等先进技术逐渐占据主导地位。 Conclusion: 本综述是目前最全面的心脏病学领域NLP研究概览,展示了NLP在推动心血管疾病诊断、治疗和预防方面的巨大潜力。 Abstract: Cardiovascular disease has become increasingly prevalent in modern society and has a significant effect on global health and well-being. Heart-related conditions are intricate, multifaceted disorders, which may be influenced by a combination of genetic predispositions, lifestyle choices, and various socioeconomic and clinical factors. Information regarding these potentially complex interrelationships is dispersed among diverse types of textual data, which include patient narratives, medical records, and scientific literature, among others. Natural language processing (NLP) techniques have increasingly been adopted as a powerful means to analyse and make sense of this vast amount of unstructured data. This, in turn, can allow healthcare professionals to gain deeper insights into the cardiology field, which has the potential to revolutionize current approaches to the diagnosis, treatment, and prevention of cardiac problems. This review provides a detailed overview of NLP research in cardiology between 2014 and 2025. We queried six literature databases to find articles describing the application of NLP techniques in the context of a range of different cardiovascular diseases. Following a rigorous screening process, we identified a total of 265 relevant articles. We analysed each article from multiple dimensions, i.e., NLP paradigm types, cardiology-related task types, cardiovascular disease types, and data source types. Our analysis reveals considerable diversity within each of these dimensions, thus demonstrating the considerable breadth of NLP research within the field. We also perform a temporal analysis, which illustrates the evolution and changing trends in NLP methods employed over the last decade that we cover. To our knowledge, the review constitutes the most comprehensive overview of NLP research in cardiology to date.[35] The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
Shivam Ratnakar,Sanjay Raghavendra
Main category: cs.CL
TL;DR: 本论文首次系统研究了大语言模型(LLM)在多轮对话中的“变色龙行为”——即面对矛盾问题时立场易变的问题,尤其在结合检索功能的模型中更为显著。作者构建了包含17,770个问答对的Chameleon基准数据集,提出“变色龙分数”和“源重用率”两个指标,评估多个主流模型后发现普遍存在严重的立场不一致现象,并揭示其根源在于知识多样性不足导致模型过度依赖查询表述。研究强调在医疗、法律等关键领域部署前需进行一致性评估。
Details
Motivation: 大语言模型与检索系统的集成广泛应用,但在多轮对话中可能因输入问题的表述变化而改变立场,损害可靠性。特别是在医疗、法律等需要一致推理的高风险领域,这种‘变色龙行为’可能导致严重后果。然而此前缺乏对此现象的系统性研究,因此亟需量化分析其成因与影响。 Method: 构建了一个名为Chameleon Benchmark Dataset的大规模多轮对话数据集,涵盖12个争议性领域共1,180场对话、17,770个问答对。提出两个理论上有依据的评估指标:Chameleon Score(0-1)用于衡量立场不稳定性,Source Re-use Rate(0-1)用于衡量知识来源的多样性。在Llama-4-Maverick、GPT-4o-mini和Gemini-2.5-Flash等先进模型上进行严格评测,并通过统计相关性分析(如皮尔逊相关系数)探究立场变化与知识重用之间的关系。 Result: 所有测试模型均表现出显著的变色龙行为,Chameleon Score介于0.391至0.511之间,其中GPT-4o-mini表现最差。跨温度采样的方差极小(<0.004),排除了随机采样的干扰。分析显示源重用率与模型置信度(r=0.627)及立场变化(r=0.429)存在显著正相关(p<0.05),表明模型因知识来源单一而过度依赖问题表述方式。 Conclusion: 当前主流大语言模型在多轮对话中普遍存在立场不稳定的问题,主要归因于检索知识的多样性不足,导致模型对问题措辞过于敏感。该发现呼吁在高风险应用场景中引入一致性评估机制,并推动提升模型的知识覆盖广度与推理稳健性。 Abstract: Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.[36] so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Sriharsh Bhyravajjula,Melanie Walsh,Anna Preus,Maria Antoniak
Main category: cs.CL
TL;DR: 本研究探讨了诗歌中空格的使用,分析了19,000首英文诗歌中的空格分布,并比较了人类创作与大语言模型生成诗歌之间的差异,强调了空格在诗歌形式和语义中的重要性。
Details
Motivation: 尽管空格是诗歌形式的重要组成部分,但在自然语言处理领域未受到足够关注,尤其是在大语言模型生成诗歌任务中。 Method: 使用来自Poetry Foundation的19,000首已发表英文诗歌语料库,分析4,000名诗人的空格使用情况,并与51,000首LLM生成诗歌及12,000首网络社区未发表诗歌进行比较,同时考察不同时期、诗体和数据源的空格使用模式。 Result: 发现了不同文本处理方法对诗歌空格表示有显著影响,且人类诗人与LLM在空格使用上有明显差异,发布了包含2,800首公共领域诗歌的数据集以支持进一步研究。 Conclusion: 空格不仅是诗歌艺术的关键特征,也应成为NLP处理诗歌数据时的重要考量因素,特别是在构建大语言模型预训练数据集时需谨慎处理空格信息。 Abstract: Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.[37] Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
Sanskar Pandey,Ruhaan Chopra,Angkul Puniya,Sohom Pal
Main category: cs.CL
TL;DR: 本文提出了Beacon基准,用于测量大语言模型中的谄媚偏差(sycophancy),发现该偏差随模型规模增大而增强,并可通过提示和激活层面的干预进行调节。
Details
Motivation: 大语言模型在追求帮助性的同时可能过度顺从用户,牺牲事实准确性,形成谄媚偏差,需有效评估和干预。 Method: 设计单轮强制选择基准Beacon,剥离对话上下文影响,量化模型在事实正确性与迎合用户之间的权衡;提出提示层和激活层干预方法。 Result: 在12个先进模型中验证了谄媚偏差可分解为稳定的语言和情感子偏差,且随模型容量增加;干预手段能有效调节这两类偏差。 Conclusion: 将谄媚视为可测量的规范误泛化形式,Beacon为研究和缓解大模型对齐漂移提供了可复现的基础。 Abstract: Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.[38] Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games
Yikai Zhang,Ye Rong,Siyu Yuan,Jiangjie Chen,Jian Xie,Yanghua Xiao
Main category: cs.CL
TL;DR: 本文提出了一种名为SCO-PAL的步级策略优化方法,通过自我对弈显著提升了语言智能体在动态对抗性游戏中的战略推理能力,相较于基线方法平均胜率提高了约30%,并对抗GPT-4达到了54.76%的胜率。
Details
Motivation: 现有语言智能体在动态对抗游戏中因战略推理能力差而表现不佳,且依赖昂贵的专家标注数据,缺乏有效的自动学习机制。 Method: 提出SCO-PAL(Step-level poliCy Optimization through Play-And-Learn)方法,通过在不同水平对手下进行自我对弈和交互学习,系统分析对手选择对学习效果的影响。 Result: 使用SCO-PAL结合自我对弈,在六个对抗性游戏中平均胜率较基线提升约30%,对GPT-4的胜率达到54.76%。 Conclusion: 自我对弈是提升语言智能体在动态对抗环境中战略推理能力的最有效方式,SCO-PAL为无需专家数据的自主学习提供了可行路径。 Abstract: Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.[39] LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
Sheikh Jubair,Arwa Omayrah,Amal Alshammari,Alhanoof Althnian,Abdulhamed Alothaimen,Norah A. Alzahrani,Shahad D. Alzaidi,Nora Al-Twairesh,Abdulmohsen Al-Thubaity
Main category: cs.CL
TL;DR: 本文提出了LC-Eval,一个双语、多任务的长上下文理解评测基准,涵盖英语和阿拉伯语,支持4k到128k以上token的上下文长度,包含四个新任务,评估LLM在深度推理、文档理解等方面的能力。
Details
Motivation: 现有的长上下文理解评测方法不足,需要更严格、更具挑战性的多语言、多任务基准来全面评估大模型在长文本理解方面的能力。 Method: 构建了一个包含多文档问答、双语问答、段落内声明验证和基于长上下文的多项选择题四个新任务的双语评测集LC-Eval,并在开源和闭源大模型上进行实验评估。 Result: 实验表明即使是GPT-4o等高性能模型在某些任务上也表现不佳,说明LC-Eval具有较高难度和挑战性,能有效揭示现有模型的局限。 Conclusion: LC-Eval是一个有效的长上下文理解评测基准,能够全面评估大模型在双语、多任务场景下的长文本理解与推理能力,为未来研究提供了重要工具。 Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.[40] MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning
Vera Pavlova,Mohammed Makhlouf
Main category: cs.CL
TL;DR: 本文提出了MOSAIC,一种结合领域特定掩码监督的多阶段句子嵌入模型领域自适应框架,在高/低资源场景下均显著优于通用领域基线。
Details
Motivation: 将大规模通用领域句子嵌入模型有效适配到专业领域面临语义漂移和数据稀缺挑战,现有方法难以兼顾领域相关性和语义判别能力。 Method: 提出MOSAIC框架,通过统一训练流程联合优化掩码语言建模(MLM)和对比学习目标,并采用分阶段自适应策略进行领域迁移。 Result: 在高/低资源领域实验中,NDCG@10指标最高提升13.4%,消融研究验证了联合监督与分阶段设计的有效性。 Conclusion: MOSAIC通过平衡的联合监督和分阶段对比学习,能有效提升句子嵌入模型在专业领域的表现,同时保持原有语义判别能力。 Abstract: We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.[41] Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities
Hans Hergen Lehmann,Jae Hee Lee,Steven Schockaert,Stefan Wermter
Main category: cs.CL
TL;DR: 研究发现大语言模型在实体数值属性比较任务中常依赖流行度、提及顺序和语义共现等表面启发式线索而非真实知识,较小模型更易受这些偏见影响,而较大模型能选择性使用可靠数值知识,且思维链提示可促使所有模型更多使用数值信息。
Details
Motivation: 理解大语言模型在知识推理任务中是依赖真实知识还是表面启发式线索,特别是在有明确真值的数值比较任务中的行为差异。 Method: 通过设计实体数值属性比较任务(如河流长度),分析不同规模模型在有准确知识情况下是否仍受流行度、提及顺序和语义共现等启发式偏差影响,并用逻辑回归预测模型选择。 Result: 发现LLMs常违背其已有知识做出错误预测;小模型的预测可用表面线索准确预测,表明其依赖启发式;大模型(32B)能区分知识可靠性并更合理使用数值信息;思维链提示可引导各尺寸模型更多使用数值特征。 Conclusion: 模型规模越大越能选择性依赖可靠知识,解释了为何大模型表现更优;小模型缺乏这种辨别能力;思维链能有效减少启发式偏差,促进原则性推理。 Abstract: Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?''), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model's own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7--8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.[42] Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank
Shantanu Agarwal,Joel Barry,Steven Fincke,Scott Miller
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的两阶段检索与重排序框架,用于跨领域作者归属任务,在HIATUS的HRS1和HRS2基准上显著超越了现有最先进方法。
Details
Motivation: 传统信息检索中的训练策略在跨领域作者归属任务中不适用,因这类任务需忽略主题线索,专注于识别与文本内容无关的作者特异性语言模式。 Method: 采用两阶段的检索与重排序框架,并引入针对性的数据筛选策略,使重排序模块能有效学习区分作者的语言特征。 Result: 在HIATUS的HRS1和HRS2跨领域作者归属基准上,Success@8指标分别提升了22.3和34.4个百分点。 Conclusion: 所提出的框架和数据策略显著提升了跨领域作者归属性能,验证了针对该任务定制训练方法的重要性。 Abstract: Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.[43] Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation
Navreet Kaur,Hoda Ayad,Hayoung Jung,Shravika Mittal,Munmun De Choudhury,Tanushree Mitra
Main category: cs.CL
TL;DR: 本文提出了CoRUS框架,通过角色理论和在线阿片类药物康复社区的数据,构建提问者角色分类体系,并生成基于角色的问答数据,用于评估语言模型在不同用户角色下的响应差异,发现弱势角色获得更支持性但知识性较弱的回应。
Details
Motivation: 现有语言模型评估常忽略提问者角色的影响,尤其在阿片类药物使用障碍等污名化领域,用户背景对回应的适宜性至关重要,需考虑角色隐含需求。 Method: 基于角色理论和r/OpiatesRecovery社区帖子,构建患者、照护者、从业者三类提问者角色的 taxonomy,并据此模拟生成15,321个嵌入角色特征的问题,用于评估五个大语言模型在不同角色下的回应模式。 Result: 生成的问题具有高可信度且与真实数据相似;在相同问题不同角色下,模型对患者和照护者等弱势角色的回应更支持性(+17%),但知识内容减少(-19%)。 Conclusion: 提问者的隐含角色显著影响语言模型的回应风格,CoRUS提供了一种以用户为中心、基于角色的对话式AI评估方法。 Abstract: Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.[44] FinSight: Towards Real-World Financial Deep Research
Jiajie Jin,Yuyao Zhang,Yimeng Xu,Hongjin Qian,Yutao Zhu,Zhicheng Dou
Main category: cs.CL
TL;DR: FinSight是一个基于多智能体框架的金融报告生成系统,通过CAVM架构和迭代视觉增强机制,实现了高质量、多模态的专业金融报告自动生成。
Details
Motivation: 当前AI系统难以完全自动化生成专业级金融报告,需要解决数据整合、分析深度与可视化质量等问题。 Method: 提出FinSight框架,包含代码智能体与可变内存(CAVM)架构、迭代视觉增强机制,以及两阶段写作框架,支持可执行代码驱动的数据分析与报告生成。 Result: 在多种公司和行业任务上,FinSight在事实准确性、分析深度和展示质量方面显著优于现有基线模型,包括领先的深度研究系统。 Conclusion: FinSight为实现接近人类专家水平的金融报告自动生成提供了可行路径。 Abstract: Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.[45] Neuronal Group Communication for Efficient Neural representation
Zhengqi Pei,Qingming Huang,Shuhui Wang
Main category: cs.CL
TL;DR: 本文提出了Neuronal Group Communication (NGC)框架,将神经网络视为由相互作用的神经元组构成的动力系统,通过低秩、模块化表示实现高效、可解释的模型压缩与推理能力提升。
Details
Motivation: 现代神经网络规模不断增大,带来性能提升的同时也导致效率低下和难以解释的问题,因此需要构建能够学习高效、模块化和可解释表示的大规模神经网络。 Method: NGC将权重视为神经元状态间的瞬时交互,神经计算通过神经元组之间的迭代通信实现;引入基于动力系统理论的神经元稳定性度量(类似Lyapunov稳定性),分析激活向稳定模式收缩的过程,并揭示推理能力的涌现源于外部驱动力或“势”。 Result: 在大语言模型中实例化NGC后,在中等压缩率下显著提升了复杂推理任务的表现,且优于标准的低秩近似和跨层共享基方法;同时实现了参数大幅减少和模型性能增强。 Conclusion: NGC提供了一种新型神经网络建模范式,其结构化的神经元组动力学不仅有助于模型压缩和推理,还可能揭示高维学习系统中泛化能力的机制。 Abstract: The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.[46] Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
Zhihui Yang,Yupei Wang,Kaijie Mo,Zhe Zhao,Renfen Hu
Main category: cs.CL
TL;DR: 该论文提出了一种基于感知理论的具身知识理解基准,评估多模态语言模型在多种感官模态下的表现,发现视觉-语言模型并未优于纯文本模型,且在视觉维度上表现更差,表明当前模型对具身知识的整合仍存在不足。
Details
Motivation: 探究视觉 grounding 是否真正提升了语言模型对具身知识的理解能力,与纯文本模型相比是否有优势。 Method: 基于心理学中的感知理论构建包含视觉、听觉、触觉、味觉、嗅觉及本体感觉的具身知识理解基准,通过向量比较和问答任务(超过1700个问题)评估30种最先进的语言模型。 Result: 实验结果显示视觉-语言模型在两项任务中均未优于纯文本模型,且在视觉维度上的表现显著差于其他感官维度;进一步分析表明模型的向量表示易受词形和词频影响,且在涉及空间感知和推理的问题上表现不佳。 Conclusion: 当前多模态模型尚未有效整合具身知识,需更有效的机制来提升其对物理世界的理解能力。 Abstract: Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.[47] ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models
Emily Chang,Niyati Bafna
Main category: cs.CL
TL;DR: ChiKhaPo是一个新的基准测试,旨在评估生成式语言模型在2700多种语言中的词汇理解与生成能力,覆盖语言数量远超现有基准,揭示了当前主流大模型在低资源语言上的表现局限。
Details
Motivation: 现有的大语言模型评测基准主要集中于高、中等资源语言,且多关注推理和生成等高阶任务,忽视了模型在世界上大多数书面语言中的基本语言能力不足的问题。 Method: 构建包含8个不同难度子任务的ChiKhaPo基准,利用现有词典、单语数据和双语数据,评估模型在词汇理解和生成方面的能力,并对6个最先进模型进行测试,分析其在不同语系、资源水平、任务类型及理解与生成方向上的表现差异。 Result: 实验显示,6个最先进的大语言模型在ChiKhaPo上表现不佳,尤其在低资源语言上存在显著短板;部分子任务覆盖超过2700种语言,语言覆盖范围远超现有基准。 Conclusion: ChiKhaPo为大规模多语言评估提供了新工具,揭示了当前大模型在基本语言能力上的局限,呼吁更多关注低资源语言的模型开发与评测。 Abstract: Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.[48] Prompt-MII: Meta-Learning Instruction Induction for LLMs
Emily Xiao,Yixiao Zeng,Ada Chen,Chin-Jou Li,Amanda Bertsch,Graham Neubig
Main category: cs.CL
TL;DR: 提出PROMPT-MII,一种基于强化学习的指令生成框架,可在新任务上用更少token实现与上下文学习相当或更好的性能。
Details
Motivation: 上下文学习(ICL)虽然有效,但随着上下文长度增加导致推理成本高,因此需要更高效的适应方法。 Method: 采用强化学习框架进行元学习,训练一个指令归纳模型,在3000多个分类数据集上学习生成紧凑且描述性强的提示,并在90个未见任务上评估。 Result: PROMPT-MII在下游任务中比ICL提升4-9 F1分数(相对提高10-20%),同时减少3-13倍的token使用量。 Conclusion: PROMPT-MII能高效生成高质量指令,在显著降低推理开销的同时达到甚至超越上下文学习的性能。 Abstract: A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.[49] Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection
Akif Islam,Mohd Ruhul Ameen
Main category: cs.CL
TL;DR: 本研究首次将参数高效微调(PEFT)技术应用于孟加拉语仇恨言论检测,使用LoRA和QLoRA方法在BD-SHS数据集上对三种大语言模型进行微调,仅训练少于1%的参数即可在单个消费级GPU上实现高效实验,其中Llama-3.2-3B取得了92.23%的F1分数,证明了该方法在低资源语言中的实用性和可复现性。
Details
Motivation: 孟加拉语社交媒体上的仇恨言论激增,尤其影响女性和青少年,但现有检测方法依赖高成本的全模型微调或专有API,缺乏高效、可复制的解决方案。 Method: 采用参数高效微调(PEFT)技术,利用LoRA和QLoRA对Gemma-3-4B、Llama-3.2-3B和Mistral-7B三种指令微调的大语言模型进行适应,在BD-SHS数据集(50,281条标注评论)上仅微调不到1%的参数。 Result: Llama-3.2-3B达到最高的F1分数92.23%,Mistral-7B为88.94%,Gemma-3-4B为80.25%,所有模型均能在单个消费级GPU上完成训练。 Conclusion: PEFT是一种高效且可复制的方法,适用于孟加拉语等低资源语言的仇恨言论检测,显著降低计算成本同时保持高性能。 Abstract: Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.[50] Back to Bytes: Revisiting Tokenization Through UTF-8
Amit Moryossef,Clara Meister,Pavel Stepachev,Desmond Elliott
Main category: cs.CL
TL;DR: 提出UTF8Tokenizer,一种基于字节的极简分词器,直接将文本映射为UTF-8编码对应的ID,利用C0控制字符实现特殊功能,提升效率与兼容性。
Details
Motivation: 现有字节级分词方法可能引入越界ID或辅助标记,缺乏一致性与效率,因此需要一种更简洁、标准化的方法来统一处理文本与控制信息。 Method: 将文本的每个字节直接映射为对应token ID(如x09→9),使用C0控制字符表示特殊行为(如padding、分隔符等),并引入位偏置嵌入(bit-biased embeddings)以暴露字节内部结构,提升训练效果。 Result: 实现了14倍更快的分词速度,主机与设备间传输减少8倍(相比int64);嵌入表更小且可在模型间共享;通过位偏置嵌入提升语言建模收敛速度,且不增加推理开销。 Conclusion: UTF8Tokenizer通过遵循UTF-8和ASCII设计原则,提供了一种高效、可扩展、跨模型兼容的分词方案,适用于多种下游任务且无需额外推理成本。 Abstract: We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.[51] Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
Yuval Reif,Guy Kaplan,Roy Schwartz
Main category: cs.CL
TL;DR: 提出一种基于变换向量的组合式词汇重构方法,减少词汇表冗余,提升覆盖范围而不影响模型性能。
Details
Motivation: 标准分词将词形变体视为独立标记,导致词汇表冗余并限制多样性和多语言覆盖。 Method: 利用嵌入空间中的加性变换向量表示词形变化(如“walk”+过去时→“walked”),通过基形式与变换向量组合构建词汇。 Result: 在多个大模型和五种语言上验证,最多减少10%词汇表条目,释放空间用于更丰富词汇,并扩展对未登录词的覆盖,几乎不影响下游任务性能。 Conclusion: 推动词汇设计从枚举字符串转向利用语言内在结构的组合式方法。 Abstract: Large language models (LLMs) were shown to encode word form variations, such as "walk"->"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.[52] Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
Masahiro Kaneko,Zeerak Talat,Timothy Baldwin
Main category: cs.CL
TL;DR: 提出一种基于在线学习和强化学习的动态防御框架,有效抵御迭代越狱攻击,同时提升无害任务的响应质量。
Details
Motivation: 现有防御方法无法主动打断迭代越狱攻击的试错循环,需开发能动态适应攻击变化的防御机制。 Method: 结合强化学习优化提示,并引入Past-Direction Gradient Damping(PDGD)防止过拟合,通过在线学习动态更新防御策略。 Result: 在三个大语言模型上测试,显著优于五种现有防御方法,能有效抵抗五种迭代越狱攻击,同时提升无害任务的响应质量。 Conclusion: 该框架能有效防御迭代式越狱攻击,并在增强安全性的同时改善模型对正常请求的响应性能。 Abstract: Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.[53] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
Lanni Bu,Lauren Levin,Amir Zeldes
Main category: cs.CL
TL;DR: 本文提出了DiscoTrack,一个涵盖12种语言和四个层次话语理解的LLM基准测试,专注于跨大文本的隐含信息和语用推理,评估结果表明这些任务对当前最先进的模型仍然具有挑战性。
Details
Motivation: 现有的LLM基准主要关注从单句中提取显性信息的自然语言理解任务,缺乏针对多语言、跨句子、段落和多说话人语境下隐含信息和语用推理的更具挑战性的基准测试。 Method: 设计了一个名为DiscoTrack的新基准,包含12种语言和四个层次的话语理解任务:显著性识别、实体追踪、话语关系和桥接推理。 Result: 评估结果显示,即使是当前最先进的模型,在这些任务上仍然表现不佳,说明这些任务具有较高挑战性。 Conclusion: DiscoTrack为评估LLM在多语言和复杂话语理解方面的能力提供了新的工具,并揭示了现有模型在处理隐含信息和跨文本推理方面的不足。 Abstract: Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.[54] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Qiusi Zhan,Angeline Budiman-Chan,Abdelrahman Zayed,Xingzhi Guo,Daniel Kang,Joo-Kyung Kim
Main category: cs.CL
TL;DR: 本文研究了基于大语言模型的搜索代理在开放域问答中的安全性问题,发现其比基础LLM更容易产生有害输出。为此提出SafeSearch方法,通过多目标强化学习联合优化安全性和实用性,显著降低有害性的同时保持问答性能。
Details
Motivation: 搜索代理在提升实用性的同时可能降低安全性,现有研究忽视了其潜在风险,需要同时对齐安全与效用。 Method: 提出SafeSearch,采用多目标强化学习,结合最终输出的安全/效用奖励和查询级别的塑造项,惩罚不安全查询,奖励安全查询。 Result: 实验显示SafeSearch在三个红队数据集上将代理有害性降低超过70%,同时保持与仅优化效用的代理相当的问答性能,验证了查询级奖励的有效性。 Conclusion: 通过联合优化查询级和输出级的安全与效用,SafeSearch能有效提升搜索代理的整体安全性而不牺牲性能。 Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.[55] Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification
Noor Islam S. Mohammad
Main category: cs.CL
TL;DR: 提出xLSTM框架,结合余弦相似性门控、自适应特征优先级和类别重平衡,在毒评检测任务中以更少参数和更低延迟超越BERT性能。
Details
Motivation: 现有模型在毒评检测中存在计算成本高、对少数毒性类别表现差或语义适应性不足的问题。 Method: 引入可学习参考向量通过余弦相似性调制上下文嵌入,融合多源嵌入、字符级BiLSTM、嵌入空间SMOTE和自适应焦点损失。 Result: 在Jigsaw基准上达到96.0%准确率和0.88宏观F1,威胁类提升33%,身份仇恨类提升28%,参数减少15倍,推理延迟50ms。 Conclusion: 轻量且理论指导的架构可在不平衡、领域特定的NLP任务上超越大型预训练模型。 Abstract: Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.[56] Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
Kyle Cox,Jiawei Xu,Yikun Han,Rong Xu,Tianhao Li,Chi-Yang Hsu,Tianlong Chen,Walter Gerych,Ying Ding
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLM)对提示词敏感的问题,提出通过在语义概念空间中采样(如使用改写扰动)来改善不确定性校准,并引入一种新的黑箱LLM不确定性分解指标,以量化提示敏感性对模型不确定性的影响。
Details
Motivation: 大语言模型在面对语义相同但表达不同的提示时可能产生差异巨大的输出,说明其不确定性未能准确反映对提示语义的理解,因此需要更好地建模和校准这种提示敏感性带来的不确定性。 Method: 将提示敏感性建模为一种泛化误差,利用改写扰动在语义概念空间中进行采样,提升不确定性校准;同时提出一种新的基于语义连续性的不确定性分解度量方法,优于传统的基于熵的分解。 Result: 改写扰动能有效提升模型的不确定性校准而不降低准确性;新提出的不确定性分解指标可量化提示敏感性对整体不确定性的贡献。 Conclusion: 提示敏感性是影响LLM不确定性校准的重要因素,通过语义空间采样和新的分解指标可有效改善校准效果,表明当前一些LLM在输入语义的通用推理上缺乏一致性。 Abstract: An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.[57] Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
Guoqing Luo,Iffat Maab,Lili Mou,Junichi Yamagishi
Main category: cs.CL
TL;DR: 该论文研究了推理型大语言模型在社会偏见场景中如何通过内部思维过程聚合社会刻板印象,并发现两种导致偏见的失败模式:刻板印象重复和无关信息注入。基于此,作者提出一种轻量级的提示缓解方法,通过让模型自我审查初始推理来减少偏见,在多个基准测试中有效降低了偏见且保持或提升了准确性。
Details
Motivation: 尽管推理型大语言模型在复杂任务中表现出色,但其内部思维过程可能聚合社会刻板印象,导致偏见输出。这种现象背后的机制尚不明确,因此需要系统性探究其行为并开发有效的缓解策略。 Method: 通过系统分析模型在问答(BBQ、StereoSet)和开放式文本生成(BOLD)任务中的推理过程,识别出两种导致社会偏见的失败模式:刻板印象重复和无关信息注入;进而设计一种基于提示的自我审查机制,引导模型检查并修正其初始推理中的这些问题。 Result: 所提出的提示式自我审查方法在多个基准上显著降低了模型输出的社会偏见,同时保持甚至提高了任务准确率,验证了其有效性与实用性。 Conclusion: 推理过程中出现的社会偏见主要源于刻板印象的重复使用和无关信息的引入,而通过针对性的提示工程实现自我审查,可在不牺牲性能的前提下有效缓解此类偏见。 Abstract: While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.[58] Verification-Aware Planning for Multi-Agent Systems
Tianyang Xu,Dan Zhang,Kushan Mitra,Estevam Hruschka
Main category: cs.CL
TL;DR: VeriMAP是一种面向多智能体协作的验证感知规划框架,通过定义子任务验证函数来提升系统的鲁棒性、可解释性和协作可靠性。
Details
Motivation: 多智能体协作中常因任务理解、输出格式或交接环节的细微偏差导致执行失败,现有方法缺乏有效的规划与验证机制。 Method: 提出VeriMAP框架,将任务分解并建模子任务依赖关系,使用Python和自然语言编码规划器定义的验证标准作为子任务验证函数(VFs)。 Result: 在多个数据集上评估显示,VeriMAP优于单智能体和多智能体基线方法,提升了系统鲁棒性和可解释性。 Conclusion: 验证感知规划能有效支持多智能体系统中的可靠协作与迭代优化,且无需依赖外部标签或注释。 Abstract: Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.[59] DVAGen: Dynamic Vocabulary Augmented Generation
Wei Du,Nuowei Liu,Jie Wang,Jiahao Kuang,Tao Ji,Xiaoling Wang,Yuanbin Wu
Main category: cs.CL
TL;DR: 本文提出了DVAGen,一个开源的统一框架,用于增强语言模型的动态词汇能力,支持现代大模型并提升推理效率。
Details
Motivation: 现有动态词汇方法在处理新词或未登录词时存在代码分散、不支持现代大模型和推理扩展性差等问题。 Method: 设计了一个模块化框架DVAGen,集成CLI和WebUI工具,支持与开源大模型无缝对接,并实现批量推理。 Result: 验证了动态词汇方法在现代大模型上的有效性,显著提升了推理吞吐量,并支持实时结果查看。 Conclusion: DVAGen为动态词汇增强语言模型提供了灵活、可扩展且易用的解决方案。 Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.[60] Rethinking On-policy Optimization for Query Augmentation
Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar
Main category: cs.CL
TL;DR: 本文系统比较了基于提示和基于强化学习的查询增强方法,发现简单的无训练查询增强在强LLM下表现优异,并提出一种结合两者优势的新方法OPQE,通过生成伪文档提升检索性能。
Details
Motivation: 现有基于提示和强化学习的查询增强方法缺乏统一比较,且各自存在局限,需要探索更高效、协同的方案。 Method: 在多个信息检索基准上对提示法和强化学习法进行系统对比,并提出新方法OPQE,利用LLM策略生成最大化检索效果的伪文档,融合提示生成与强化学习优化。 Result: 实验表明简单提示方法常媲美甚至优于复杂的RL方法;所提OPQE方法在多种任务上均优于单独使用提示或RL重写的方法。 Conclusion: 结合提示的生成灵活性与强化学习的目标优化,通过策略生成伪文档的OPQE是更优的查询增强范式。 Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.[61] When AI companions become witty: Can human brain recognize AI-generated irony?
Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai
Main category: cs.CL
TL;DR: 该研究探讨了人们在理解AI生成的反语时是否采取“意向立场”,即是否将AI的言语视为有意图的交流。通过行为和神经反应(P200和P600脑电成分)分析,发现人们对AI生成的反语较少归因为有意沟通,更多视为计算错误,且相关神经反应较弱,表明其未完全对AI采取意向立场。个体若认为AI更真诚,则更倾向于将其话语视为有意图,说明人类对AI意图的归因取决于对其心理模型的认知。研究揭示,实现真正的社交能动性不仅需要语言能力,还需改变人类对AI意图的感知方式。
Details
Motivation: 随着大语言模型被用作社交代理并生成幽默与反语,亟需探究人类是否将其输出视为有意图的交流,还是仅仅为机械计算结果。这一问题关乎AI社会智能的本质及其人际互动的有效性。 Method: 通过比较人类与AI生成反语时的行為與神經反應(使用ERP技術,關注P200與P600成分),探討受試者是否對AI採取意向立場。實驗設計基於反語理解所需的語義重分析過程,以區分有意圖的矛盾與無意的錯誤。 Result: 行為結果顯示,參與者較少將AI產生的不協調視為有意溝通,而更傾向於解釋為計算錯誤;神經數據顯示,AI反語引發的P200和P600效應較弱,表明早期不協調檢測與後續認知重分析的努力減少。此外,認為AI更真誠的個體表現出更強的P200和P600反應。 Conclusion: 儘管當前LLM具有高度語言能力,但人類並未完全對AI採取意向立場。能否將AI視為有意圖的社會主體,取決於人類對其心理模型的認知(如誠實度)。因此,實現真正社會能動性不僅需提升AI語言表現,更需改變人類對AI意圖的歸因方式。 Abstract: As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.[62] Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Jiaqi Leng,Xiang Hu,Junxiong Wang,Jianguo Li,Wei Wu,Yucheng Lu
Main category: cs.CL
TL;DR: 本文提出并验证了三个关键设计原则,使基于块的稀疏注意力模型能够实现卓越的训练后长度外推能力,成功将4K上下文训练的模型扩展到3200万token级别。
Details
Motivation: 标准Transformer在处理长上下文时受限于二次复杂度和较差的长度外推能力,而其他架构因固定内存限制难以充分利用完整上下文,因此需要理解并改进基于块的稀疏注意力模型的核心机制。 Method: 通过统一框架和全面消融实验,系统分析基于块的稀疏注意力模型,提出三个关键设计:非线性块编码器(含专用CLS token)、旁路残差路径、以及预训练中的选择稀疏性约束,并提供块内信息处理与地标生成的理论依据。 Result: 所提出的模型在RULER和BABILong基准上实现了从4K上下文到3200万token的训练自由长度外推,达到当前最优性能。 Conclusion: 三个核心设计原则——表达性强的非线性块编码、稳定融合全局信息的旁路残差路径、以及缓解训练-测试分布差异的选择稀疏性——是构建高性能长上下文语言模型的关键。 Abstract: Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.[63] Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Chenchen Tan,Youyang Qu,Xinghao Li,Hui Zhang,Shujie Cui,Cunjian Chen,Longxiang Gao
Main category: cs.CL
TL;DR: 提出一种名为Attention-Shifting(AS)的新框架,用于大语言模型的 selective unlearning,通过注意力机制调整在保留模型效用的同时有效抑制敏感信息记忆并减少幻觉响应。
Details
Motivation: 现有遗忘方法在保护模型效用和彻底遗忘敏感数据之间存在权衡,激进方法损害模型性能,保守方法易导致幻觉,限制了LLM在知识密集型应用中的可靠性。 Method: 设计了两个注意力层面的干预机制:重要性感知抑制(针对需遗忘内容)和注意力引导保留增强(强化保留数据中语义关键token的注意力),并通过双损失目标联合优化,实现局部遗忘与整体知识保持的平衡。 Result: 在ToFU和TDEC基准上分别比现有最优方法提升最高15%和10%的准确性,同时保持较强的无幻觉遗忘效果,展现出更优的遗忘有效性、泛化能力和响应可靠性平衡。 Conclusion: AS框架通过注意力迁移机制有效解决了选择性遗忘中的效用-安全两难问题,为大语言模型的安全可控遗忘提供了高效可靠的解决方案。 Abstract: The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.[64] StreamingThinker: Large Language Models Can Think While Reading
Junlong Tong,Yingqi Fan,Anhao Zhao,Yunpu Ma,Xiaoyu Shen
Main category: cs.CL
TL;DR: 提出了一种名为StreamingThinker的流式思考范式,使大语言模型在输入过程中即可进行推理,显著降低延迟并保持与批处理相当的性能。
Details
Motivation: 现有的大语言模型推理范式需等待完整输入后才开始思考,导致延迟高且对早期信息关注不足,难以适应动态场景。受人类边读边想的启发,希望实现模型在输入流中实时推理。 Method: 设计了流式思考范式,结合流式CoT生成、流式约束训练和流式并行推理,通过流式推理单元、顺序保持注意力掩码、位置编码和并行KV缓存实现边输入边推理,并在Qwen3模型上验证。 Result: 在数学、逻辑和上下文问答任务中,StreamingThinker实现了与批处理推理相当的性能,同时将推理启动前的token等待时间减少80%,端到端时延降低超过60%。 Conclusion: 流式思考范式有效提升了大语言模型在动态场景下的推理效率,兼顾性能与低延迟,为实际应用提供了新的可行路径。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}[65] From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu
Main category: cs.CL
TL;DR: 本文提出了VideoBiasEval框架,用于系统评估视频生成中的社会偏见,发现对齐微调会强化并稳定化性别和种族偏见。
Details
Motivation: 现有文本到视频生成模型在提升视觉质量的同时可能放大社会偏见,缺乏系统性工具追踪偏见在对齐流程中的演变。 Method: 基于社会偏见分类体系,提出基于事件的提示策略,分离动作语义与人物属性,并设计多粒度指标评估整体种族偏见、条件性别偏见、分布变化及偏见的时间持续性。 Result: 首次实现从人类偏好数据、奖励模型到对齐微调视频扩散模型的端到端偏见分析,发现对齐过程会增强且稳定化表征偏见,导致更平滑但更具刻板印象的视频输出。 Conclusion: 需在整个对齐过程中引入偏见感知的评估与缓解机制,以确保视频生成的公平性与社会责任。 Abstract: Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.[66] How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design
Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Ayesha Siddiqua,Jungpil Shin
Main category: cs.CL
TL;DR: 该研究通过大规模情感分析探讨了孟加拉语新闻中的情绪偏向,发现负面情绪(尤其是愤怒、恐惧和失望)占主导地位,并提出了一个可视化情感线索的人本新闻聚合工具设计思路。
Details
Motivation: 新闻媒体不仅通过报道内容,还通过报道框架影响公众情绪。相同事件在不同媒体中可能呈现截然不同的情感色彩,这种隐性的情感偏见值得深入研究。 Method: 使用Gemma-3 4B模型对30万条孟加拉语新闻标题及其内容进行零样本推断,识别每条新闻的主导情绪和整体基调。 Result: 研究发现负面情绪显著主导,且相似事件在不同媒体中情感表达差异明显,尤其愤怒、恐惧和失望最为突出。 Conclusion: 新闻报道存在显著的情感偏向,提出可通过人本设计的新闻聚合器可视化情感线索,帮助读者识别隐性情感操控。 Abstract: News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.[67] Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations
Shahin Atakishiyev,Housam K. B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel
Main category: cs.CL
TL;DR: 本文综述了基于Transformer的大语言模型的局部可解释性与机制可解释性方法,探讨其在医疗和自动驾驶领域的应用,并分析了解释对用户信任的影响,最后指出了当前未解决的问题与未来研究方向。
Details
Motivation: 大语言模型在生成内容时存在难以理解的黑箱问题以及幻觉错误,亟需提升模型的可解释性以增强人类对其的信任。 Method: 本文通过文献综述和在医疗、自动驾驶两个关键领域开展可解释性与推理的实验研究,系统分析现有方法并评估解释效果。 Result: 总结了当前可解释性研究的主要方法与洞见,揭示了解释信息如何影响接收者的信任,并识别出若干尚未解决的问题。 Conclusion: 提升大语言模型的可解释性对于建立人类信任至关重要,未来应聚焦于生成符合人类认知、可靠且可解释的模型输出。 Abstract: Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.[68] TaxoAlign: Scholarly Taxonomy Generation Using Language Models
Avishek Lahiri,Yufang Hou,Debarshi Kumar Sanyal
Main category: cs.CL
TL;DR: 提出了一种名为TaxoAlign的三阶段主题指导方法,用于自动生成学术分类体系,并通过新构建的CS-TaxoBench基准测试验证其在结构对齐和语义连贯性上优于现有基线方法。
Details
Motivation: 现有自动综述生成方法未将生成结果与人类专家构建的分类体系进行比较,缺乏对结构一致性的评估。 Method: 提出TaxoAlign方法,包含三个阶段:基于主题的指令引导生成;构建CS-TaxoBench基准(含460个来自人工综述的分类体系,另加80个会议综述构成测试集);设计严格的自动化评估框架,衡量生成分类体系与人工构建之间的结构对齐和语义一致性。 Result: 在CS-TaxoBench上的实验表明,TaxoAlign在几乎所有自动指标和人工评估中均显著优于基线方法,展现出更强的结构对齐与语义连贯性。 Conclusion: TaxoAlign能有效缩小自动生成与人工构建分类体系之间的差距,所提出的评估框架和基准数据集为未来研究提供了重要支持。 Abstract: Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.[69] Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning
Hajar Bakarou,Mohamed Sinane El Messoussi,Anaïs Ollagnier
Main category: cs.CL
TL;DR: 本研究利用法语多参与方对话数据集CyberAgressionAdo-Large,评估了多种文本与图表示学习方法在检测社交媒体反社会行为(如欺凌、虐待)中的表现,发现多模态融合模型mBERT + WD-SGCN效果最佳。
Details
Motivation: 现有研究多集中于X和Reddit等平台,而对多参与方对话场景中的反社会行为关注不足,且缺乏可用数据。本文旨在填补这一空白。 Method: 采用CyberAgressionAdo-Large数据集,对六种文本模型和八种图模型进行基准测试,分析词汇特征、互动动态及其多模态融合在三项任务上的表现:虐待检测、欺凌行为分析和欺凌同伴群识别。 Result: 多模态模型优于单模态基线,其中晚融合模型mBERT + WD-SGCN在虐待检测(F1=0.718)上表现最佳,在同伴群识别(0.286)和欺凌分析(0.606)上也具有竞争力,且能有效处理隐性攻击、角色转换等复杂现象。 Conclusion: 融合文本与图结构信息的多模态方法更适用于复杂对话环境中的反社会行为识别,具有更强的上下文理解能力。 Abstract: Antisocial behavior (ASB) on social media -- including hate speech, harassment, and cyberbullying -- poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.[70] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou
Main category: cs.CL
TL;DR: 本文提出了Nyx,一种用于通用检索增强生成(URAG)的统一多模态检索器,能够处理文本和图像混合的查询与文档,并通过构建高质量的多模态数据集NyxQA和两阶段训练框架显著提升视觉-语言生成任务的表现。
Details
Motivation: 现有RAG系统主要针对纯文本,难以应对现实世界中图文混合的多模态场景,缺乏有效的多模态检索与生成协同机制。 Method: 提出Nyx,一个统一的多模态到多模态检索器;构建四阶段自动化管道生成NyxQA数据集;采用预训练加基于下游视觉语言模型反馈的监督微调两阶段训练方法。 Result: 实验表明,Nyx在标准文本RAG基准上表现良好,在多模态URAG场景下显著提升生成质量,优于现有方法。 Conclusion: Nyx有效解决了多模态RAG中的关键挑战,在真实混合模态场景中展现出优越性能,推动了检索增强生成向更通用、现实的应用发展。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.[71] The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives
Henry Lim,Kwan Hui Lim
Main category: cs.CL
TL;DR: 研究评估了20个指令调优大语言模型在不同选项标签格式下的表现,发现其对指令格式存在显著偏差,尤其在缺乏明确指导或非数字标签时表现更差,揭示当前指令调优范式在原子级指令遵循上的不足。
Details
Motivation: 尽管指令调优大语言模型展现出强大的零样本推理能力,但其执行简单、独立指令的能力仍被忽视,而这是复杂指令遵循的基础。因此需要系统评估其在基础指令遵循上的表现。 Method: 通过修改MMLU和MMLU-Pro基准,系统性地改变选项标签的格式(字母、数字、罗马数字),在四种范式下评估20个IT-LLMs的表现:有无明确指令、是否移除选项内容、是否提供三样本示例,并分析生成结果中的标签错误。 Result: 1) 标签格式变化导致性能显著波动(如罗马数字比数字低30.45%);2) 缺乏指令时性能进一步下降且对标签约敏;3) 移除选项内容后模型普遍低于随机基线,仅数字标签表现尚可;4) 三样本示例未提升鲁棒性,生成中持续出现标签错误;更大的模型准确率更高但仍不一致。 Conclusion: 当前指令调优范式在原子级指令遵循上存在缺陷,需发展专门针对基础指令遵循的评估方法和训练策略。 Abstract: Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.[72] EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Numaan Naeem,Abdellah El Mekki,Muhammad Abdul-Mageed
Main category: cs.CL
TL;DR: 本文提出了EduAdapt,一个包含近48,000个标注年级的科学问答对的基准数据集,用于评估大语言模型在K-12教育中根据学生年级调整回答的能力。研究发现现有模型在低年级学生(1-5年级)的回答适应性方面表现不佳,强调了开发更符合认知发展阶段的教育AI系统的必要性。
Details
Motivation: 大语言模型在学术任务上表现良好,但难以根据学生的年级水平调整输出内容,这在K-12教育中是一个关键问题。缺乏标准化基准来评估模型在不同认知和发育阶段的适应能力,因此需要一个专门的数据集来衡量和提升模型的年级适应性。 Method: 构建了一个名为EduAdapt的基准数据集,包含九个科学学科、覆盖1-12年级的近48,000个标注问答对,并将其划分为四个年级组。在此基础上评估多种开源大语言模型的表现,分析其在不同年级层次上的响应适配能力。 Result: 实验表明,尽管较大的模型整体性能更好,但在为低年级学生(1-5年级)生成合适回答时仍存在显著困难,暴露出当前模型在年龄适宜性方面的不足。 Conclusion: EduAdapt是首个针对大语言模型年级适应性的评估框架,有助于推动通过改进训练和提示策略,发展更符合学习者发展阶段的教育AI系统。 Abstract: Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.[73] Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
Jiacheng Xie,Shuai Zeng,Yang Yu,Xiaoting Tang,Guanghui An,Dong Xu
Main category: cs.CL
TL;DR: 本研究提出了首个专注于中医(TCM)的大型语言模型Ladder-base,采用组相对策略优化(GRPO)方法进行训练,显著提升了推理能力和事实一致性,在多项评估中优于通用和领域专用模型。
Details
Motivation: 现有中医大模型在对齐性、数据质量和评估一致性方面存在局限,难以充分适应中医独特的知识体系。 Method: 基于Qwen2.5-7B-Instruct模型,使用TCM-Ladder基准中的文本子集进行训练,并采用GRPO强化学习方法通过组内比较优化响应选择。 Result: Ladder-base在多种推理指标上均优于GPT-4、Gemini 2.5、Claude 3、Qwen3等通用模型以及BenTsao、HuatuoGPT2、Zhongjing等中医专用模型。 Conclusion: GRPO是一种有效且高效的策略,有助于将大模型与传统医学领域的专家级推理对齐,推动可信且临床可靠的中医人工智能系统发展。 Abstract: Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.[74] AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
Mardiyyah Oduwole,Prince Mireku,Fatimo Adebanjo,Oluwatosin Olajide,Mahi Aminu Aliyu,Jekaterina Novikova
Main category: cs.CL
TL;DR: 提出AfriCaption框架,支持20种非洲语言的多语言图像描述生成,包含高质量数据集、动态管道和专用模型,推动低资源语言的多模态AI发展。
Details
Motivation: 现有研究集中于高资源语言,导致非洲等低资源语言在多模态AI中被边缘化,阻碍了技术的普惠性。 Method: 基于Flickr8k构建语义对齐的数据集,采用上下文感知的选词与翻译方法;设计动态、保质的模型集成与自适应替换流程;开发集成SigLIP和NLLB200的0.5B参数视觉到文本模型AfriCaption。 Result: 建立了首个面向20种非洲语言的可扩展图像描述框架,确保了持续的数据质量,并实现了跨低资源语言的有效生成。 Conclusion: AfriCaption为非洲低资源语言提供了首个可扩展的多语言图像描述解决方案,推动了真正包容的多模态AI的发展。 Abstract: Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.[75] BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine
Jiacheng Xie,Yang Yu,Yibo Chen,Hanyao Zhang,Lening Zhao,Jiaxuan He,Lei Jiang,Xiaoting Tang,Guanghui An,Dong Xu
Main category: cs.CL
TL;DR: 本研究开发了BenCao,一个基于ChatGPT的多模态中医助手,通过自然语言指令微调整合结构化知识库、诊断数据和专家反馈,提升了中医领域大模型在诊断、草药识别和体质分类中的准确性与可解释性,并已在全球部署应用。
Details
Motivation: 中医依赖整体思维、隐性逻辑和多模态诊断线索,现有中医领域大模型缺乏多模态整合、可解释性和临床适用性,亟需一种更贴近专家推理且具备实际应用能力的AI系统。 Method: 基于ChatGPT构建BenCao,采用自然语言指令微调而非参数重训练,集成包含千余部经典与现代文献的知识库、基于场景的指令框架、思维链模拟机制及执业中医师反馈优化流程,并连接外部API实现舌象图像分类与多模态数据库检索。 Result: BenCao在单选题基准和多模态分类任务中表现优于通用及中医领域模型,尤其在诊断、草药识别和体质分类方面准确率更高,并已作为交互式应用上线OpenAI GPTs商店,截至2025年10月全球用户近1000人。 Conclusion: 通过自然语言指令微调和多模态整合,可有效构建符合中医思维特点的领域大模型,为生成式AI与传统医学融合提供了可行框架和可扩展的落地路径。 Abstract: Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.[76] Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
Tiancheng Hu,Benjamin Minixhofer,Nigel Collier
Main category: cs.CL
TL;DR: 本文提出通过在对齐前后模型权重之间进行插值的简单后处理方法,有效缓解对齐税带来的准确率下降和校准性损失,发现帕累托最优插值模型能同时提升准确率并恢复校准性。
Details
Motivation: 对齐过程通常导致模型准确率下降(对齐税),本文指出对齐还严重损害模型的校准性和输出多样性,从而影响可靠性和置信度。 Method: 采用模型权重插值方法,在对齐前和对齐后的模型权重之间进行线性组合,并评估不同插值比例下的模型性能与校准性。 Result: 该方法能够找到帕累托最优的插值点,使模型在准确率上超过两个父模型,同时显著恢复对齐过程中丧失的校准性,并提升输出多样性。 Conclusion: 简单的模型融合技术可高效缓解对齐税的多重负面影响,得到更强大且更可靠的模型,表明对齐性能与校准性之间并非严格的权衡关系。 Abstract: The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.[77] Agentic Reinforcement Learning for Search is Unsafe
Yushi Yang,Shreyansh Padarha,Andrew Lee,Adam Mahdi
Main category: cs.CL
TL;DR: 本研究发现,经过强化学习训练的搜索模型虽然继承了指令调优中的拒绝机制,但其安全性脆弱,容易受到简单攻击,导致生成有害搜索和回答,暴露出当前代理式强化学习在安全方面的严重缺陷。
Details
Motivation: 理解代理式强化学习模型在自主调用工具(如搜索)时的安全特性,尤其是面对有害请求时的表现和潜在漏洞。 Method: 通过设计两种简单攻击(Search攻击和Multi-search攻击),测试Qwen和Llama两类模型在本地与网络搜索场景下的拒绝率、回答安全性和搜索查询安全性变化。 Result: 两种攻击使拒绝率最高下降60.0%,回答安全性下降82.5%,搜索查询安全性下降82.4%;攻击成功的原因是模型在触发拒绝令牌前已生成有害的镜像请求查询。 Conclusion: 当前的代理式强化学习训练方式忽略了查询的有害性,仅奖励有效查询生成,导致安全漏洞;亟需开发注重安全意识的代理式RL训练流程。 Abstract: Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.[78] Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings
Manuela Daniela Danu,George Marica,Constantin Suciu,Lucian Mihai Itu,Oladimeji Farri
Main category: cs.CL
TL;DR: 本研究开发了多种深度上下文嵌入模型,以提升低资源语言(英语、西班牙语和意大利语)在心脏病学领域的临床命名实体识别(NER)性能。
Details
Motivation: 电子健康记录中非结构化临床文本蕴含大量生物医学知识,但在低资源语言中的临床NER研究较少,亟需填补这一空白。 Method: 采用基于BERT的单语和多语言模型,在通用领域文本上进行训练,用于从三种语言的临床病例报告中提取疾病和药物提及。 Result: 在西班牙语疾病识别(SDR)、西班牙语药物识别(SMR)、英语药物识别(EMR)和意大利语药物识别(IMR)任务上分别达到77.88%、92.09%、91.74%和88.9%的F1分数,均优于测试排行榜的平均值和中位数。 Conclusion: 所提出的模型在多语言临床NER任务中表现优异,尤其在低资源语言环境下展现出较强的有效性和应用潜力。 Abstract: The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.[79] Evaluating Large Language Models on Urdu Idiom Translation
Muhammad Farmal Khan,Mousumi Akter
Main category: cs.CL
TL;DR: 本文介绍了首个用于乌尔都语到英语习语翻译的评估数据集,涵盖原生乌尔都文和罗马化乌尔都文,并评估了多种大语言模型和神经机器翻译系统在保留习语及文化含义方面的能力。研究发现提示工程能提升翻译效果,且原生乌尔都文的翻译质量优于罗马化形式。
Details
Motivation: 习语翻译在低资源语言如乌尔都语中仍具挑战性,且相关研究较少,缺乏标准评估数据集。 Method: 构建了覆盖两种乌尔都语书写形式的双语评估数据集,采用BLEU、BERTScore、COMET和XCOMET等自动指标,评估多个开源大模型和NMT系统在不同提示策略下的表现。 Result: 提示工程相比直接翻译提升了习语翻译效果,但不同提示类型间差异较小;原生乌尔都语输入比罗马化形式产生更准确的翻译结果。 Conclusion: 文本表示方式显著影响翻译质量,未来应重视书写形式对低资源语言习语翻译的影响,并进一步优化提示策略以提升文化语义的保留。 Abstract: Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.[80] Disparities in Multilingual LLM-Based Healthcare Q&A
Ipek Baris Schlicht,Burcu Sayin,Zhixue Zhao,Frederik M. Labonté,Cesare Barbera,Marco Viviani,Paolo Rosso,Lucie Flek
Main category: cs.CL
TL;DR: 该研究系统地分析了多语言大型语言模型(LLM)在医疗问答中的跨语言差异,发现不同语言在维基百科覆盖范围和LLM事实对齐方面存在显著差距,英文内容占据主导地位。通过构建Multilingual Wiki Health Care(MultiWikiHealthCare)数据集并采用检索增强生成(RAG),研究表明引入非英语语境信息可有效提升非英语语言的事实对齐性,从而为构建更公平的多语言医疗AI系统提供了可行路径。
Details
Motivation: 确保AI在医疗领域的公平应用需要可靠且均衡的多语言健康信息支持,但当前多语言信息质量参差不齐,可能影响LLM在非英语语境下的可靠性与一致性,因此有必要系统评估跨语言差异并探索改进方法。 Method: 1) 构建名为MultiWikiHealthCare的多语言维基百科医疗数据集;2) 分析五种语言(英语、德语、土耳其语、中文、意大利语)在医疗内容覆盖上的差异;3) 评估多个LLM在不同语言提问下回答与对应语言维基内容的事实对齐程度;4) 利用上下文信息和检索增强生成(RAG)进行案例研究,分析如何提升非英语语言的事实一致性。 Result: 研究发现各语言在维基百科医疗内容覆盖和LLM事实对齐方面存在显著跨语言差异:LLM的回答普遍更倾向于与英语维基内容一致,即使输入为非英语提示。但在推理时引入非英语维基的上下文片段后,能有效使LLM输出向特定语言的文化相关知识对齐,提升事实准确性。 Conclusion: 多语言医疗AI系统存在以英语为中心的偏差,影响非英语用户的公平性;通过结合本地语言知识源(如非英语维基)和RAG技术,可在实际中改善模型的事实性和文化适配性,推动更公平的全球医疗AI发展。 Abstract: Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.[81] ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
Zheyue Tan,Zhiyuan Li,Tao Yuan,Dong Zhou,Weilin Liu,Yueqing Zhuang,Yadong Li,Guowei Niu,Cheng Qin,Zhuyu Yao,Congyi Liu,Haiyang Xu,Boxun Li,Guohao Dai,Bo Zhao,Yu Wang
Main category: cs.CL
TL;DR: ReXMoE提出了一种新的Mixture-of-Experts架构,通过跨层专家复用和渐进式扩展路由策略,突破了传统层局部路由的限制,在不增加参数量的情况下提升了模型表达能力和下游任务性能。
Details
Motivation: 现有MoE架构受限于层局部路由机制,难以在专家维度和路由多样性之间取得良好平衡,限制了模型的表达能力。 Method: 提出ReXMoE架构,支持相邻层间专家复用,并设计渐进式扩展路由(PSR)策略,在训练过程中逐步扩大候选专家池,从而解耦专家维度与每层预算。 Result: 在0.5B到7B参数规模的多种架构上实验表明,ReXMoE在语言建模和下游任务上均优于传统MoE方法,且保持相同架构尺寸下的参数效率。 Conclusion: ReXMoE提供了一种新的高效且可扩展的MoE设计范式,有效提升了专家组合的丰富性和模型整体性能。 Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.[82] DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning
Yongxin He,Shan Zhang,Yixuan Cao,Lei Ma,Ping Luo
Main category: cs.CL
TL;DR: 本文提出了一种名为DETree的新方法,用于检测人类与AI协作生成的混合文本,通过构建层次亲和树结构和设计专门的损失函数,在多样化的协作场景下显著提升了检测性能和泛化能力。
Details
Motivation: 现有的AI文本检测方法大多局限于二分类或简单多分类,难以应对人类与AI之间复杂多样的协作生成过程,因此需要更精细建模不同生成过程间关系的方法。 Method: 提出DETree方法,将不同文本生成过程之间的关系建模为层次亲和树结构,并设计了与其对齐的专用损失函数;同时构建了包含多种协作模式的综合基准数据集RealBench。 Result: DETree在混合文本检测任务中表现优于现有方法,尤其在分布外场景和少样本条件下展现出更强的鲁棒性和泛化能力。 Conclusion: 通过引入层次化结构建模和专用损失函数,DETree有效提升了对复杂人机协作文本的检测性能,验证了基于训练的方法在OOD场景下的潜力。 Abstract: Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at https://github.com/heyongxin233/DETree.[83] Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents
Yihong Tang,Kehai Chen,Liang Yue,Jinxin Fan,Caishen Zhou,Xiaoguang Li,Yuyang Zhang,Mingming Zhao,Shixiong Kai,Kaiyang Guo,Xingshan Zeng,Wenjing Cun,Lifeng Shang,Min Zhang
Main category: cs.CL
TL;DR: 本文系统回顾了基于大语言模型的行业智能体的技术、应用和评估方法,提出了从“流程执行系统”到“自适应社会系统”的演进路径,探讨了记忆、规划与工具使用三大技术支柱及其在工业界的应用挑战与未来方向。
Details
Motivation: 如何将通用智能体的研究转化为推动产业变革的实际生产力仍是一个重大挑战,需要系统性地梳理行业智能体的发展路径和技术基础。 Method: 通过构建行业智能体能力成熟度框架,分析记忆、规划和工具使用三大技术支柱的演进,并综述其在多个实际领域的应用及现有评估方法的局限性。 Result: 明确了行业智能体在数字工程、科学发现、具身智能等领域的应用进展,揭示了当前评估体系在真实性、安全性和行业特异性方面的不足,并识别了其实用化面临的挑战。 Conclusion: 结合技术演进与产业实践,本文为理解和构建下一代行业智能体提供了清晰的路线图和理论基础。 Abstract: With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.[84] Deep Self-Evolving Reasoning
Zihan Liu,Shun Zheng,Xumeng Wen,Yang Wang,Jiang Bian,Mao Yang
Main category: cs.CL
TL;DR: 本文提出了Deep Self-Evolving Reasoning (DSER),一种基于概率框架的长链推理方法,通过将迭代推理建模为马尔可夫过程,在弱验证能力的开源小规模模型上显著提升复杂任务的求解能力,在AIME 2024-2025基准上解决了此前无法解决的5/9问题,并使8B模型超越其600B教师模型的单次推理准确率。
Details
Motivation: 现有的验证-修正框架依赖强验证能力,而开源小模型在此类能力上表现脆弱,限制了其在高难度推理任务(如奥赛级别)上的应用,因此需要一种不依赖强验证机制的新型推理范式来突破当前开放权重模型的推理瓶颈。 Method: 将迭代推理过程建模为马尔可夫链,每个步骤视为解空间中的随机转移;只要改进概率略高于退化概率,系统最终会收敛到正确解;通过并行运行多个长视野的自演化推理路径,利用多数投票聚合结果,从而放大微弱的正向趋势。 Result: 在AIME 2024-2025基准上,DSER使得DeepSeek-R1-0528-Qwen3-8B模型解决了之前无法解决的9个难题中的5个,整体性能显著提升,并通过多数投票超过了其600B参数教师模型的单轮推理准确率。 Conclusion: DSER提供了一种无需强验证即可扩展模型推理能力的新范式,揭示了当前开放权重模型在自我验证、修正和稳定性方面的根本局限,为构建具备内在自演化能力的下一代推理模型指明了研究方向。 Abstract: Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.[85] Lingua Custodi's participation at the WMT 2025 Terminology shared task
Jingshu Liu,Raheel Qader,Gaëtan Caillaut,Mariam Nakhlé
Main category: cs.CL
TL;DR: 本文提出了一种新的多语言句子嵌入模型LaBSE,结合了单语和跨语言表示学习的最佳方法,在112种语言的Tatoeba数据集上显著优于LASER,并大幅减少对平行语料的需求。
Details
Motivation: BERT在单语句子嵌入中表现优异,但基于BERT的跨语言句子嵌入尚未充分探索,因此需要系统研究多语言句子嵌入的有效方法。 Method: 结合掩码语言建模(MLM)、翻译语言建模(TLM)、双编码器翻译排序和加性边际softmax,利用预训练的多语言语言模型进行多语言句子嵌入学习。 Result: 在Tatoeba数据集上达到83.7%的双语文本检索准确率(远高于LASER的65.5%),且仅需减少80%的平行训练数据;在单语迁移学习任务中表现良好,并可用于训练有竞争力的NMT模型。 Conclusion: 所提出的模型在多语言句子嵌入方面显著优于现有方法,有效降低对平行语料的依赖,具备广泛的应用价值,模型已公开发布。 Abstract: While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.[86] Annotation-Efficient Universal Honesty Alignment
Shiyu Ni,Keping Bi,Jiafeng Guo,Minghao Tang,Jingtong Wu,Zengxin Han,Xueqi Cheng
Main category: cs.CL
TL;DR: 提出EliCal框架,通过先自我一致性监督提取内部置信度,再用少量正确性标注进行校准,实现在仅需1k标注的情况下接近最优的诚实对齐性能。
Details
Motivation: 现有基于训练的校准方法需要大规模标注数据来实现通用的诚实对齐,成本高昂,因此需要一种标注高效的方法。 Method: 采用两阶段框架:第一阶段利用廉价的自我一致性信号提取模型内部置信度;第二阶段使用少量正确性标注对置信度进行校准。同时发布HonestyBench基准用于大规模研究。 Result: EliCal在仅使用1k正确性标注(全监督的0.18%)下达到接近最优的对齐效果,并在未见的MMLU任务上优于纯校准基线。 Conclusion: EliCal为实现大语言模型的通用诚实对齐提供了一个可扩展且标注高效的解决方案。 Abstract: Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.[87] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu,Joachim Baumann,Lorenzo Lupo,Dirk Hovy,Nigel Collier,Paul Röttger
Main category: cs.CL
TL;DR: SimBench是首个大规模标准化基准,用于评估大语言模型(LLM)在模拟人类行为方面的表现,发现当前LLM模拟能力有限,性能随模型规模对数线性增长,但推理时计算资源增加无帮助,且存在对齐-模拟权衡和特定人群模拟困难问题。
Details
Motivation: 现有LLM人类行为模拟的评估方法零散、不可比,缺乏统一标准,难以系统研究模拟效果,因此需要一个大规模、标准化的基准来推动该领域的可重复科学发展。 Method: 构建SimBench基准,整合20个涵盖道德决策到经济选择等任务的多样化数据集,覆盖全球大量真实人类参与者数据,采用统一指标评估多种LLM在模拟人类反应方面的能力,并分析模型规模、推理资源、指令微调等因素的影响。 Result: 当前最佳LLM模拟得分为40.80/100,性能随模型大小呈对数线性增长;增加推理计算资源未提升表现;指令微调改善低熵问题表现但损害高熵问题;模型在模拟特定人口群体时表现较差;模拟能力与深度知识密集型推理能力(如MMLU-Pro)高度相关(r=0.939)。 Conclusion: LLM在人类行为模拟方面仍有显著局限,SimBench为衡量和改进模拟忠实度提供了可靠基础,未来需发展更注重深度推理和多样人群建模的模型以提升模拟真实性。 Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.[88] OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction
Raghu Vamshi Hemadri,Geetha Krishna Guruju,Kristi Topollai,Anna Ewa Choromanska
Main category: cs.CL
TL;DR: 提出了一种多任务学习框架,将自回归大语言模型与临床推理对齐,用于癌症治疗结果预测,在MSK-CHORD数据集上实现了可解释性和预测性能的提升。
Details
Motivation: 需要在异构临床数据下兼具准确性和可解释性的模型,而现有生物医学大语言模型缺乏结构化推理能力,难以支持高风险决策。 Method: 采用多任务学习框架,联合执行二分类生存预测、连续生存时间回归和自然语言理由生成;比较了监督微调、思维链提示和组相对策略优化(GRPO)三种对齐策略。 Result: CoT提示使F1提升6.0,MAE降低12%;GRPO在BLEU、ROUGE和BERTScore上达到最优性能;发现现有模型因架构限制难以生成有效推理路径。 Conclusion: 推理感知的对齐策略对多任务临床建模至关重要,GRPO显著提升模型可解释性与预测性能,为精准肿瘤学中的可信LLM树立新基准。 Abstract: Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.[89] When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy
Main category: cs.CL
TL;DR: 提出了一种基于拓扑数据分析(Mapper)的新视角,揭示细调后的RoBERTa-Large在处理模糊性时如何重构嵌入空间,表现出高预测纯度但与真实标签对齐下降,暴露模型在结构置信度与标签不确定性之间的张力。
Details
Motivation: 传统标量指标(如准确率)无法充分捕捉模型对模糊性的内部表征,尤其是在人类标注者存在分歧的情况下,需要更深入理解模型如何处理歧义。 Method: 使用拓扑数据分析工具Mapper分析RoBERTa-Large在MD-Offense数据集上的嵌入空间结构,并与PCA、UMAP等传统降维方法对比,揭示细调后模型的决策区域和聚类特性。 Result: 发现98%以上的连通组件具有≥90%的预测纯度,表明模型形成模块化、非凸的嵌入结构;但在模糊数据上与真实标签对齐下降,揭示出结构置信与标签不确定间的矛盾。Mapper能直接揭示决策边界坍缩和过度自信聚类等现象。 Conclusion: Mapper是一种强大的诊断工具,不仅能可视化模型对模糊性的处理机制,还可衍生出新的拓扑指标,为处理主观性NLP任务提供主动建模策略的支持。 Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.[90] Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation
Collin Zhang,Fei Huang,Chenhan Yuan,Junyang Lin
Main category: cs.CL
TL;DR: 本文提出了一种名为语言混淆门(LCG)的轻量级插件式方法,用于在解码过程中过滤令牌,有效减少大语言模型中的语言混淆问题,且无需重新训练模型。
Details
Motivation: 大语言模型在生成文本时常常出现语言混淆问题,现有方法要么需要重新训练模型,要么无法区分有害的语言混淆和可接受的语码转换。 Method: 提出语言混淆门(LCG),基于输出词嵌入范数差异和高资源语言的采样偏差,在解码过程中动态预测语言族并进行选择性掩码;采用范数调整的自蒸馏方法进行训练。 Result: 在Qwen3、GPT-OSS、Gemma3、Llama3.1等多个模型上评估显示,LCG显著减少了语言混淆现象,通常降低一个数量级,且不影响任务性能。 Conclusion: LCG是一种高效、即插即用的解决方案,能够在不修改基础大模型的前提下有效缓解语言混淆问题,兼顾了实用性与性能。 Abstract: Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at https://github.com/collinzrj/language_confusion_gate.[91] HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection
Guang Yang,Yujie Zhu
Main category: cs.CL
TL;DR: 提出了一种基于超图的适配器(HGAdapter),通过捕捉代码中的高阶相关性(如语法树、词法和行间关系)来增强预训练语言模型在代码相关任务上的性能。
Details
Motivation: 现有的预训练语言模型未充分考虑代码中潜在的高阶数据相关性,限制了其在代码理解任务中的表现。 Method: 设计了三种代码标记的高阶相关性(抽象语法树家族相关性、词法相关性和行相关性),提出了一个标记与超边生成器,并改进了超图神经网络结构,结合适配器微调方法,构建了HGAdapter以增强PLM。 Result: 在六个编程语言的代码摘要和代码克隆检测任务上进行了实验,HGAdapter在多个公共数据集上均提升了PLM的性能。 Conclusion: 引入高阶数据相关性有助于提升预训练语言模型在代码相关任务中的有效性,HGAdapter具有良好的通用性和增强能力。 Abstract: Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.[92] LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis
Huiyuan Xie,Chenyang Li,Huining Zhu,Chubin Zhang,Yuxiao Ye,Zhenghao Liu,Zhiyuan Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为LawChain的新框架,用于显式建模中国侵权民事案件中的法律推理过程,并构建了评估基准LawChain$_{eval}$,以系统评估大型语言模型在侵权法律推理中的表现,同时提出了能有效提升相关任务性能的基线方法。
Details
Motivation: 现有法律推理计算方法多依赖通用框架,未能充分捕捉法律推理的细微过程,且主要集中于刑事案件,对民事案件建模不足。因此,需要一个专门针对民事侵权案件的精细化法律推理框架。 Method: 将侵权法律分析中的推理过程操作化为三模块的LawChain框架,每个模块包含多个细粒度子步骤;基于此框架构建评估基准LawChain$_{eval}$,并设计融合LawChain推理的提示或后训练基线方法,评估并改进大模型在侵权法律推理及其他法律任务中的表现。 Result: 实验表明当前大模型在侵权法律推理关键环节上表现不足;引入的基线方法在该任务及法律命名实体识别、刑事损害计算等任务中均显著提升性能,验证了显式建模法律推理链的有效性与泛化能力。 Conclusion: 显式建模法律推理链条(如LawChain)有助于提升语言模型在民事侵权案件及其他法律分析任务中的推理能力,具有重要应用价值。 Abstract: Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.[93] Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models
Yuefeng Peng,Parnian Afshar,Megan Ganji,Thomas Butler,Amir Houmansadr,Mingxian Wang,Dezhi Hong
Main category: cs.CL
TL;DR: 本文研究了大语言模型中的知识遗忘(unlearning)方法,发现现有方法在删除特定知识后会损害模型在上下文中重新利用该知识的能力。为此,作者提出了一种新的增强目标方法,通过添加一个插件项来保留模型在提示中重新引入信息时的使用能力,实验证明该方法能有效恢复上下文效用,同时保持良好的遗忘和保留集性能。
Details
Motivation: 现有的知识遗忘评估忽略了用户可能仍希望在输入提示中重新引入被删除信息时,模型能够利用这些信息的实用性需求。 Method: 在六种最先进的遗忘方法上进行系统评估,并提出一种带有插件项的增强型遗忘目标,以保留模型对上下文中出现的已遗忘知识的利用能力。 Result: 实验表明,所提方法能在几乎不牺牲遗忘效果和保留集性能的前提下,将上下文效用恢复到接近原始水平。 Conclusion: 增强的遗忘目标能有效平衡知识遗忘与上下文中的实用性,提升了遗忘方法的实际可用性。 Abstract: Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.[94] Qomhra: A Bilingual Irish-English Large Language Model
Joseph McInerney
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: This paper introduces Qomhr\'a, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhr\'a is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhr\'a also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.[95] Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues
Liqun He,Manolis Mavrikis,Mutlu Cukurova
Main category: cs.CL
TL;DR: 本文提出了一种对话分析方法,用于从学习者与大语言模型(LLM)的交互中识别有效的教学策略,强调应关注对话动态和教学策略来评估基于LLM的教育应用。
Details
Motivation: 现有对LLM在教育中应用的评估多关注技术性能或学习结果,忽视了学习者与LLM之间的互动,因此需要一种更注重对话过程的评估方法。 Method: 研究采用对话数据收集、对话行为(DA)标注、DA模式挖掘和预测模型构建的方法,分析学习者与LLM之间的对话。 Result: 初步研究结果揭示了潜在的有效教学策略,为后续研究提供了起点。 Conclusion: 应重视对话动态和教学策略在评估LLM教育应用中的作用,该研究为深入理解人机教育对话提供了新视角和方法框架。 Abstract: Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.[96] QueST: Incentivizing LLMs to Generate Difficult Problems
Hanxu Hu,Xingxing Zhang,Jannis Vamvas,Rico Sennrich,Furu Wei
Main category: cs.CL
TL;DR: 本文提出QueST框架,通过难度感知的图采样和拒绝微调生成大规模、高难度的合成编程问题,显著提升小模型在竞争性编码任务中的表现,具备良好的可扩展性。
Details
Motivation: 现有竞争性编程数据集规模有限,且依赖人工标注,限制了大模型在复杂推理任务上的可扩展性。需要一种能自动生成高难度、大规模训练数据的方法。 Method: 提出QueST框架,结合难度感知图采样和难度感知拒绝微调,训练专用生成器来生成高难度编程问题,并用于知识蒸馏或强化学习。 Result: 使用QueST生成的10万难题微调Qwen3-8B-base后,在LiveCodeBench上超越原版Qwen3-8B;加入额外11.2万样本后,8B模型性能匹敌671B的DeepSeek-R1。 Conclusion: QueST能有效生成高质量、高难度的编程问题,为大语言模型的推理能力提供可扩展的训练路径。 Abstract: Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.[97] PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition
Nanda Kumar Rengarajan,Jun Yan,Chun Wang
Main category: cs.CL
TL;DR: 提出了一种轻量级的少样本命名实体识别(NER)框架,通过新的指令微调模板和保留实体信息的数据增强技术,在低资源场景下实现了与现有最先进模型相当的性能。
Details
Motivation: 在标注数据稀缺的低资源场景中,现有的零样本和指令微调方法难以有效泛化到领域特定实体,且无法充分利用有限数据。 Method: 设计了一种简化的指令微调模板,结合先前指令调优方法的优点,并利用大语言模型的大上下文窗口;引入一种保留实体信息的同时对上下文进行改写的策略性数据增强方法,以扩充训练数据并保持语义关系。 Result: 在基准数据集上的实验表明,该方法在少样本和零样本任务上表现优异,少样本方法在CrossNER数据集上平均F1得分达到80.1;使用改写增强训练的模型相比基线最高提升了17个点的F1得分。 Conclusion: 所提出的框架为标注数据和计算资源有限的场景提供了一种高效的NER解决方案,兼具高性能与实用性。 Abstract: Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.[98] AcademicEval: Live Long-Context LLM Benchmark
Haozhen Zhang,Tao Feng,Pengrui Han,Jiaxuan You
Main category: cs.CL
TL;DR: 提出AcademicEval,一个基于arXiv论文的长上下文生成任务的动态评测基准,无需人工标注,避免标签泄露,支持灵活上下文长度,并揭示LLM在层次化抽象和长少样本推理上的挑战。
Details
Motivation: 现有长上下文LLM评测基准受限于固定上下文长度、人工标注成本高以及训练中的标签泄露问题,亟需更可靠、动态且免标注的评估方式。 Method: 利用arXiv论文构建学术写作任务(如标题、摘要、引言、相关工作),通过收集的合著者图提供高质量、专家策划的少样本示例,实现灵活上下文输入和无标签泄露的实时评估。 Result: 实验表明现有LLM在具有层次化抽象的任务上表现不佳,且在处理长少样本输入时存在困难,验证了该基准的挑战性和有效性。 Conclusion: AcademicEval为长上下文LLM提供了更真实、动态和安全的评估平台,揭示了当前模型在复杂长文本生成任务中的不足,为改进长上下文建模能力提供了方向。 Abstract: Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval[99] Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
Tong Chen,Akari Asai,Luke Zettlemoyer,Hannaneh Hajishirzi,Faeze Brahman
Main category: cs.CL
TL;DR: 提出一种基于二值检索增强奖励(binary RAR)的在线强化学习方法,有效降低语言模型幻觉,同时避免在开放生成和下游任务中性能下降。
Details
Motivation: 现有缓解语言模型幻觉的方法常损害开放生成和下游任务性能,限制了实际应用,因此需要一种能平衡事实性与生成质量的新方法。 Method: 采用在线强化学习框架,设计一种新的二值检索增强奖励(RAR)机制:仅当模型输出完全正确时给予奖励1,否则为0,从而精确引导模型生成事实性内容。 Result: 在Qwen3推理模型上验证,开放生成中幻觉率降低39.3%;在PopQA和GPQA上错误答案分别减少44.4%和21.7%,并学会在知识不足时合理拒绝回答;且未损害指令遵循、数学和代码等任务性能。 Conclusion: 二值RAR方法在提升语言模型事实准确性方面优于监督训练和连续奖励强化学习,且不牺牲其他关键任务表现,解决了事实性与生成质量间的权衡问题。 Abstract: Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.[100] Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications
Xiao Ye,Jacob Dineen,Zhaonan Li,Zhikun Xu,Weiyu Chen,Shijie Lu,Yuxi Huang,Ming Shen,Phu Tran,Ji-Eun Irene Yum,Muhammad Ali Khan,Muhammad Umar Afzal,Irbaz Bin Riaz,Ben Zhou
Main category: cs.CL
TL;DR: 本文通过自主性层级(L0-L3)框架重新审视医学大语言模型的评估,将现有基准与各层级的操作及风险对齐,提出分级的评估蓝图,推动从分数导向转向面向真实临床应用的风险感知评估。
Details
Motivation: 尽管医学大语言模型在标准基准上表现良好,但在临床工作流中实现安全可靠的应用仍具挑战,现有评估方式缺乏对自主性与风险的明确考量。 Method: 引入自主性层级框架(L0-L3),将信息工具、信息处理、决策支持和监督代理分层,并将现有评估基准和指标与各层级的权限和风险对齐。 Result: 提出了一个基于层级的评估蓝图,明确了各层级应采用的指标、证据构建方式和声明报告方法,并指出了评估与监管结合的方向。 Conclusion: 以自主性为中心的评估框架有助于建立可信、风险可控的医学大语言模型评估体系,推动其在真实临床环境中的可靠应用。 Abstract: Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.[101] Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Austin Xu,Xuan-Phi Nguyen,Yilun Zhou,Chien-Sheng Wu,Caiming Xiong,Shafiq Joty
Main category: cs.CL
TL;DR: 本文提出了FARE,一个通过大规模数据微调的生成式评估模型家族,采用简单的监督学习方法,在多种推理评估任务中表现出色,超越了现有的大规模专用评估模型。
Details
Motivation: 现有研究多集中于使用强化学习等新方法训练评估器,而忽视了大规模数据驱动的开发。本文旨在探索数据扩展对生成式评估器性能的影响。 Method: 构建了一个包含250万样本、涵盖五种不同评估任务和多个推理领域的数据集,采用迭代拒绝采样监督微调(SFT)方法训练8B和20B参数的FARE模型。 Result: FARE-8B可与更大的强化学习训练评估器竞争,FARE-20B成为新的开源评估器标杆,超越70B以上的专用评估器;在MATH任务中接近oracle性能;作为RL训练中的验证器,比字符串匹配提升14.1%;持续微调的FARE-Code在测试用例质量评估上超过gpt-oss-20B达65%。 Conclusion: 数据规模和多样化的任务设计对生成式评估器至关重要,简单的SFT方法结合高质量大数据可显著优于复杂的RL方法,FARE为开源评估器树立了新标准。 Abstract: Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.[102] Executable Knowledge Graphs for Replicating AI Research
Yujie Luo,Zhuoyun Yu,Xuehai Wang,Yuqi Zhu,Ningyu Zhang,Lanning Wei,Lun Du,Da Zheng,Huajun Chen
Main category: cs.CL
TL;DR: 提出可执行知识图谱(xKG),用于提升大语言模型代理在自动复现AI研究中的代码生成能力,通过整合技术洞见、代码片段和领域知识,显著提高复现性能。
Details
Motivation: 现有方法在生成可执行代码时面临背景知识不足和检索增强生成(RAG)无法捕捉论文中隐含技术细节的问题,且缺乏对实现级代码信号的利用和结构化知识表示。 Method: 构建模块化、可插拔的可执行知识图谱(xKG),从科学文献中自动提取技术见解、代码片段和领域知识,并支持多粒度检索与复用,集成到多种代理框架中。 Result: 在PaperBench上,结合三种代理框架和两种大语言模型测试,xKG带来了最高10.9%的性能提升(使用o3-mini)。 Conclusion: xKG是一种通用且可扩展的解决方案,有效提升了LLM代理在AI研究复现中的表现,尤其在生成可执行代码方面具有显著优势。 Abstract: Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at https://github.com/zjunlp/xKG.[103] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
Akshara Prabhakar,Roshan Ram,Zixiang Chen,Silvio Savarese,Frank Wang,Caiming Xiong,Huan Wang,Weiran Yao
Main category: cs.CL
TL;DR: 本文提出了一种名为Enterprise Deep Research (EDR)的多智能体系统,用于自动化处理企业中的非结构化数据并生成可操作的洞察,具备自适应查询分解、多种专业搜索代理、可视化和反思机制,并在多个基准上优于现有最先进系统。
Details
Motivation: 随着信息爆炸式增长,企业迫切需要将非结构化数据转化为连贯且可操作的洞察。然而,现有自主智能体在领域特定细节、意图对齐和企业集成方面存在不足。 Method: EDR系统包含五个核心组件:主规划智能体(负责自适应查询分解)、四个专业化搜索智能体(通用、学术、GitHub、LinkedIn)、基于MCP的可扩展工具生态系统(支持NL2SQL、文件分析和企业工作流)、可视化智能体以及具备知识缺口检测和研究方向更新能力的反思机制,支持无人工干预或人机协同模式。 Result: EDR在内部数据集上验证了其自动化报告生成、实时流式处理和企业无缝部署能力;在DeepResearch Bench和DeepConsult等开放性基准测试中,性能超过当前最先进的智能体系统,且无需人工引导。 Conclusion: EDR提供了一个高效、可扩展的多智能体框架,显著提升了企业在复杂信息环境下的研究与决策能力,作者已开源该框架及基准数据集以推动多智能体推理应用的研究发展。 Abstract: As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise-deep-research and Dataset at https://huggingface.co/datasets/Salesforce/EDR-200cs.CV [Back]
[104] ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
Jiani Huang,Amish Sethi,Matthew Kuo,Mayank Keoliya,Neelay Velingker,JungHo Jung,Ser-Nam Lim,Ziyang Li,Mayur Naik
Main category: cs.CV
TL;DR: 提出ESCA框架和SGClip模型,通过结构化时空理解提升多模态大语言模型在具身智能体中的表现,无需人工标注即可生成场景图,并显著减少感知错误。
Details
Motivation: 现有MLLM训练依赖高层视觉-声音-文本对,缺乏像素级视觉内容与文本语义的细粒度对齐,限制了具身智能体的感知能力。 Method: 提出ESCA框架与基于CLIP的SGClip模型,利用神经符号学习 pipeline 在87K+视频上进行自监督训练,从视频-字幕对中自动生成场景图,实现开放域、可提示的场景图生成与动作定位。 Result: SGClip在场景图生成和动作定位任务上表现优异;ESCA显著提升开源和商用MLLM在两个具身环境中的性能,降低感知错误,使开源模型超越专有基线。 Conclusion: ESCA通过结构化空间-时间理解增强了多模态语言模型的具身感知能力,为无需人工标注的场景理解提供了有效解决方案。 Abstract: Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.[105] CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
Huiming Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的稀疏多模态3D目标检测器CrossRay3D,通过引入Ray-Aware Supervision和Class-Balanced Supervision提升稀疏检测器的token表示质量,结合Ray PE解决LiDAR与图像间的分布差异,在nuScenes数据集上实现了最先进的性能,并具有更高的计算效率和模态缺失下的鲁棒性。
Details
Motivation: 现有稀疏跨模态检测器忽视了token表示的质量,导致前景质量不佳、性能受限,尤其是在处理小物体和复杂场景时表现不理想。 Method: 提出Sparse Selector(SS),包含Ray-Aware Supervision(RAS)以保留几何结构信息,Class-Balanced Supervision以增强小物体类别的表示,并设计Ray Positional Encoding(Ray PE)来对齐LiDAR和图像模态的分布差异,最终构建端到端的稀疏多模态检测器CrossRay3D。 Result: 在nuScenes基准上达到72.4 mAP和74.7 NDS,性能领先且推理速度比其他先进方法快1.84倍,同时在LiDAR或相机数据缺失情况下仍保持强鲁棒性。 Conclusion: CrossRay3D通过改进token选择与表示机制,显著提升了稀疏跨模态检测器的性能与效率,为后续任务提供了更优的特征表示基础。 Abstract: The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.[106] InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects
Ibrahim Sheikh Mohamed,Abdullah Yahya Abdullah Omaisan
Main category: cs.CV
TL;DR: 本文提出了一种利用街道路侧CCTV视频流进行多缺陷检测与分割,并结合视觉语言模型生成结构化维修行动计划的综合管道。
Details
Motivation: 城市基础设施中的裂缝、坑洞和泄漏威胁公共安全,传统人工巡检成本高且危险,现有自动系统往往只能处理单一缺陷类型或输出非结构化结果,难以直接指导维修。 Method: 采用YOLO系列目标检测器对CCTV视频流中的多种缺陷进行检测与分割,并将结果输入视觉语言模型(VLM)进行场景感知的总结,生成JSON格式的结构化行动方案。 Result: 在公开数据集和实际CCTV片段上的实验表明,该系统能准确识别多种缺陷并生成连贯的结构化摘要。 Conclusion: 该方法有效整合了目标检测与视觉语言模型,能够为智能城市基础设施维护提供自动化、结构化的决策支持,具备扩展到城市规模部署的潜力,但仍面临实际部署中的挑战。 Abstract: Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.[107] IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection
Zewen Li,Zitong Yu,Qilang Ye,Weicheng Xie,Wei Zhuo,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型(MLLMs)的工业异常检测新范式IAD-GPT,结合图像级和像素级信息与丰富文本语义,通过异常提示生成器、文本引导增强模块和多掩码融合模块提升异常检测与分割性能,在MVTec-AD和VisA数据集上实现了最先进的效果。
Details
Motivation: 传统工业异常检测方法缺乏多轮人机对话和细粒度描述能力,而基于预训练大模型的方法未能充分激发其在异常检测中的潜力。 Method: 提出IAD-GPT,利用异常提示生成器(APG)生成针对特定对象的详细异常提示,并通过CLIP等视觉-语言模型实现检测与分割;引入文本引导增强模块以动态选择特征增强路径,结合多掩码融合模块融入专家知识,提升像素级异常感知。 Result: 在MVTec-AD和VisA数据集上,IAD-GPT在自监督和少样本异常检测与分割任务中均取得了最先进的性能。 Conclusion: IAD-GPT有效结合了多模态大语言模型的语义理解能力和视觉模型的定位能力,显著提升了工业异常检测的准确性和可解释性,支持细粒度描述与多轮交互。 Abstract: The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.[108] Effect of Reporting Mode and Clinical Experience on Radiologists' Gaze and Image Analysis Behavior in Chest Radiography
Mahta Khoobi,Marc Sebastian von der Stueck,Felix Barajas Ordonez,Anca-Maria Iancu,Eric Corban,Julia Nowak,Aleksandar Kargaliev,Valeria Perelygina,Anna-Sophie Schott,Daniel Pinto dos Santos,Christiane Kuhl,Daniel Truhn,Sven Nebelung,Robert Siepmann
Main category: cs.CV
TL;DR: 该研究比较了自由文本(FT)、结构化报告(SR)和AI辅助结构化报告(AI-SR)三种模式对床旁胸片解读的影响,发现AI-SR在诊断准确性、报告效率和用户体验方面表现最佳。
Details
Motivation: 探索结构化报告和AI如何改善放射科医生的影像分析行为和诊断性能。 Method: 前瞻性研究,纳入4名新手和4名非新手读者,使用定制查看器和眼动追踪系统分析35张床旁胸片,比较三种报告模式下的诊断准确率、报告时间、眼动指标和用户体验。 Result: AI-SR的诊断准确率(κ=0.71)显著高于FT和SR(P<0.001),报告时间最短(25±9秒),眼动指标显示视觉注意力更集中,用户偏好最高。SR也比FT更高效。 Conclusion: 结构化报告能通过引导视觉注意力提升效率,而AI预填的结构化报告进一步提高诊断准确性和用户满意度。 Abstract: Structured reporting (SR) and artificial intelligence (AI) may transform how radiologists interact with imaging studies. This prospective study (July to December 2024) evaluated the impact of three reporting modes: free-text (FT), structured reporting (SR), and AI-assisted structured reporting (AI-SR), on image analysis behavior, diagnostic accuracy, efficiency, and user experience. Four novice and four non-novice readers (radiologists and medical students) each analyzed 35 bedside chest radiographs per session using a customized viewer and an eye-tracking system. Outcomes included diagnostic accuracy (compared with expert consensus using Cohen's $\kappa$), reporting time per radiograph, eye-tracking metrics, and questionnaire-based user experience. Statistical analysis used generalized linear mixed models with Bonferroni post-hoc tests with a significance level of ($P \le .01$). Diagnostic accuracy was similar in FT ($\kappa = 0.58$) and SR ($\kappa = 0.60$) but higher in AI-SR ($\kappa = 0.71$, $P < .001$). Reporting times decreased from $88 \pm 38$ s (FT) to $37 \pm 18$ s (SR) and $25 \pm 9$ s (AI-SR) ($P < .001$). Saccade counts for the radiograph field ($205 \pm 135$ (FT), $123 \pm 88$ (SR), $97 \pm 58$ (AI-SR)) and total fixation duration for the report field ($11 \pm 5$ s (FT), $5 \pm 3$ s (SR), $4 \pm 1$ s (AI-SR)) were lower with SR and AI-SR ($P < .001$ each). Novice readers shifted gaze towards the radiograph in SR, while non-novice readers maintained their focus on the radiograph. AI-SR was the preferred mode. In conclusion, SR improves efficiency by guiding visual attention toward the image, and AI-prefilled SR further enhances diagnostic accuracy and user satisfaction.[109] Data-Driven Analysis of Intersectional Bias in Image Classification: A Framework with Bias-Weighted Augmentation
Farjana Yesmin
Main category: cs.CV
TL;DR: 提出了一种数据驱动的框架来分析和减轻图像分类中的交叉偏差,包括公平性评估框架IFEF和基于子组分布统计调整变换强度的数据增强策略BWA。
Details
Motivation: 机器学习模型在不平衡数据集上训练时常常表现出由多个属性(如对象类别和环境条件)交互引起的系统性错误(即交叉偏差),需要有效的方法来识别并缓解这些偏差。 Method: 提出了Intersectional Fairness Evaluation Framework (IFEF),结合定量公平性指标与可解释性工具识别偏差模式;在此基础上提出Bias-Weighted Augmentation (BWA),一种根据子组分布统计数据自适应调整变换强度的数据增强策略。 Result: 在Open Images V7数据集上的实验表明,BWA将代表性不足的类别-环境交叉部分的准确率提高了最高24个百分点,并将公平性指标差异减少了35%,且结果具有统计显著性(p < 0.05)。 Conclusion: 该方法为分析和解决图像分类中的交叉偏差提供了可复现的有效途径。 Abstract: Machine learning models trained on imbalanced datasets often exhibit intersectional biases-systematic errors arising from the interaction of multiple attributes such as object class and environmental conditions. This paper presents a data-driven framework for analyzing and mitigating such biases in image classification. We introduce the Intersectional Fairness Evaluation Framework (IFEF), which combines quantitative fairness metrics with interpretability tools to systematically identify bias patterns in model predictions. Building on this analysis, we propose Bias-Weighted Augmentation (BWA), a novel data augmentation strategy that adapts transformation intensities based on subgroup distribution statistics. Experiments on the Open Images V7 dataset with five object classes demonstrate that BWA improves accuracy for underrepresented class-environment intersections by up to 24 percentage points while reducing fairness metric disparities by 35%. Statistical analysis across multiple independent runs confirms the significance of improvements (p < 0.05). Our methodology provides a replicable approach for analyzing and addressing intersectional biases in image classification systems.[110] Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
Zia Badar
Main category: cs.CV
TL;DR: 本文提出了一种可微分的神经网络量化方法,解决了以往工作在梯度传播和激活函数量化方面的不足,证明了算法收敛性,并在ImageNet上使用ResNet18实现了接近全精度的性能,仅需15个训练周期。
Details
Motivation: 现有量化方法大多不可微分,依赖人为设定梯度,影响学习能力;同时在权重与激活值的对数/移位量化中难以兼顾精度与效率,尤其缺乏有效的激活量化方案。 Method: 提出一种可微分的移位/对数量化方法,支持n比特量化,结合可证明收敛的优化算法,在训练过程中联合学习量化参数,适用于权重和激活值的联合量化。 Result: 在ImageNet上使用ResNet18进行实验,仅权重量化时准确率损失小于1%,15个epoch即收敛;在权重和激活均量化的情况下达到与SOTA相当的精度,推理成本略高于1比特量化但无需高精度乘法运算。 Conclusion: 该方法提供了一种高效、可微分且收敛可证的神经网络量化方案,显著提升训练效率与量化精度,尤其在移位/对数量化框架下实现了实用化的低比特量化。 Abstract: Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.[111] StripRFNet: A Strip Receptive Field and Shape-Aware Network for Road Damage Detection
Jianhan Lin,Yuchu Qin,Shuai Gao,Yikang Rui,Jie Liu,Yanjie Lv
Main category: cs.CV
TL;DR: 提出了一种名为StripRFNet的新型深度神经网络,用于道路表面损伤检测,包含三个模块以解决细长裂缝和小尺度损伤识别难题,在RDD2022基准上实现了最先进的精度和实时效率。
Details
Motivation: 道路表面损伤威胁交通安全并阻碍可持续城市发展,现有方法在识别多样形状、细长裂缝和小尺度损伤方面存在挑战。 Method: 设计了StripRFNet,包括形状感知模块(SPM)、条带感受野模块(SRFM)和小尺度增强模块(SSEM),结合多尺度特征聚合、条带卷积与池化、高分辨率特征图与动态上采样技术。 Result: 在RDD2022数据集上,StripRFNet在中文子集上F1-score、mAP50和mAP50:95分别提升4.4、2.9和3.4个百分点;在全集上达到80.33%的最高F1-score,且保持竞争性推理速度。 Conclusion: StripRFNet在道路损伤检测中实现了最先进的准确性和实时性能,为智能道路维护和可持续基础设施管理提供了有效工具。 Abstract: Well-maintained road networks are crucial for achieving Sustainable Development Goal (SDG) 11. Road surface damage not only threatens traffic safety but also hinders sustainable urban development. Accurate detection, however, remains challenging due to the diverse shapes of damages, the difficulty of capturing slender cracks with high aspect ratios, and the high error rates in small-scale damage recognition. To address these issues, we propose StripRFNet, a novel deep neural network comprising three modules: (1) a Shape Perception Module (SPM) that enhances shape discrimination via large separable kernel attention (LSKA) in multi-scale feature aggregation; (2) a Strip Receptive Field Module (SRFM) that employs large strip convolutions and pooling to capture features of slender cracks; and (3) a Small-Scale Enhancement Module (SSEM) that leverages a high-resolution P2 feature map, a dedicated detection head, and dynamic upsampling to improve small-object detection. Experiments on the RDD2022 benchmark show that StripRFNet surpasses existing methods. On the Chinese subset, it improves F1-score, mAP50, and mAP50:95 by 4.4, 2.9, and 3.4 percentage points over the baseline, respectively. On the full dataset, it achieves the highest F1-score of 80.33% compared with CRDDC'2022 participants and ORDDC'2024 Phase 2 results, while maintaining competitive inference speed. These results demonstrate that StripRFNet achieves state-of-the-art accuracy and real-time efficiency, offering a promising tool for intelligent road maintenance and sustainable infrastructure management.[112] ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles
Nishad Sahu,Shounak Sural,Aditya Satish Patil,Ragunathan,Rajkumar
Main category: cs.CV
TL;DR: 本文提出了一种名为ObjectTransforms的新方法,通过在训练和推理阶段对物体进行特定变换来量化和减少基于视觉的目标检测中的不确定性。
Details
Motivation: 由于数据偏差和分布变化等问题,现有的基于视觉的检测器在自动驾驶中仍面临不确定性挑战,影响感知可靠性。 Method: 在训练时,对单个物体进行颜色空间扰动并使用扩散模型生成多样化的行人实例;在推理时,通过对检测到的物体施加扰动并计算检测分数的方差来实时量化不确定性,并利用该信号过滤误检和恢复漏检。 Result: 在NuImages 10K数据集上基于YOLOv8的实验表明,该方法在训练阶段显著提升了各类别的准确率并降低了不确定性,在推理阶段能有效区分误检与真检,提升精度-召回曲线。 Conclusion: ObjectTransforms是一种轻量且有效的方法,可在训练和推理阶段分别减少和量化基于视觉感知中的不确定性,提升自动驾驶感知系统的可靠性。 Abstract: Reliable perception is fundamental for safety critical decision making in autonomous driving. Yet, vision based object detector neural networks remain vulnerable to uncertainty arising from issues such as data bias and distributional shifts. In this paper, we introduce ObjectTransforms, a technique for quantifying and reducing uncertainty in vision based object detection through object specific transformations at both training and inference times. At training time, ObjectTransforms perform color space perturbations on individual objects, improving robustness to lighting and color variations. ObjectTransforms also uses diffusion models to generate realistic, diverse pedestrian instances. At inference time, object perturbations are applied to detected objects and the variance of detection scores are used to quantify predictive uncertainty in real time. This uncertainty signal is then used to filter out false positives and also recover false negatives, improving the overall precision recall curve. Experiments with YOLOv8 on the NuImages 10K dataset demonstrate that our method yields notable accuracy improvements and uncertainty reduction across all object classes during training, while predicting desirably higher uncertainty values for false positives as compared to true positives during inference. Our results highlight the potential of ObjectTransforms as a lightweight yet effective mechanism for reducing and quantifying uncertainty in vision-based perception during training and inference respectively.[113] Aria Gen 2 Pilot Dataset
Chen Kong,James Fort,Aria Kang,Jonathan Wittmer,Simon Green,Tianwei Shen,Yipu Zhao,Cheng Peng,Gustavo Solaira,Andrew Berkovich,Nikhil Raina,Vijay Baiyya,Evgeniy Oleinik,Eric Huang,Fan Zhang,Julian Straub,Mark Schwesinger,Luis Pesqueira,Xiaqing Pan,Jakob Julian Engel,Carl Ren,Mingfei Yan,Richard Newcombe
Main category: cs.CV
TL;DR: Aria Gen 2 Pilot Dataset (A2PD) 是一个使用 Aria Gen 2 眼镜采集的自我中心多模态开放数据集,包含日常活动的五种主要场景,提供原始传感器数据和机器感知算法输出,旨在推动感知研究的发展。
Details
Motivation: 为了促进对自我中心感知技术的研究和应用,提供高质量、多模态的真实世界数据。 Method: 通过配备 Aria Gen 2 眼镜的主要受试者 Dia'ane 及其朋友在真实生活场景中采集数据,涵盖清洁、烹饪、进食、玩耍和户外行走五种情景,并提供原始传感器数据与多种机器感知算法的输出结果。 Result: A2PD 成功展示了设备在不同用户和环境下对佩戴者自身、周围环境及其交互的稳定感知能力,数据已公开并附带开源工具和使用示例。 Conclusion: A2PD 是一个具有广泛应用潜力的高质量多模态数据集,有助于推动自我中心感知、人机交互和智能眼镜相关技术的发展。 Abstract: The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia'ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device's ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at projectaria.com, with open-source tools and usage examples provided in Project Aria Tools.[114] GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer
Sayan Deb Sarkar,Sinisa Stekovic,Vincent Lepetit,Iro Armeni
Main category: cs.CV
TL;DR: 提出一种基于预训练rectified flow模型的无需训练方法,通过周期性添加外观和自相似性的部分感知损失作为引导,有效实现3D资产间的外观迁移,克服了几何差异大的挑战,并采用GPT-based系统进行更合理的质量评估。
Details
Motivation: 现有方法在输入与外观对象几何差异较大时表现不佳,直接应用3D生成模型效果不理想,因此需要一种更鲁棒的外观迁移方法。 Method: 基于预训练的rectified flow模型(以图像或文本为条件),在采样过程中周期性地引入可微分的引导损失函数,包括部分感知的外观损失和自相似性损失,从而实现无需训练的外观迁移。 Result: 该方法在纹理和几何细节迁移上优于基线方法,定性和定量结果均更优;同时发现传统指标不适用于此任务,故采用GPT-based系统进行输出排序评估,并通过用户研究验证其有效性。 Conclusion: 所提方法能有效处理显著几何差异下的3D外观迁移,具有通用性,可扩展至其他扩散模型和引导函数,且提出的新评估方式更贴近人类感知。 Abstract: Transferring appearance to 3D assets using different representations of the appearance object - such as images or text - has garnered interest due to its wide range of applications in industries like gaming, augmented reality, and digital content creation. However, state-of-the-art methods still fail when the geometry between the input and appearance objects is significantly different. A straightforward approach is to directly apply a 3D generative model, but we show that this ultimately fails to produce appealing results. Instead, we propose a principled approach inspired by universal guidance. Given a pretrained rectified flow model conditioned on image or text, our training-free method interacts with the sampling process by periodically adding guidance. This guidance can be modeled as a differentiable loss function, and we experiment with two different types of guidance including part-aware losses for appearance and self-similarity. Our experiments show that our approach successfully transfers texture and geometric details to the input 3D asset, outperforming baselines both qualitatively and quantitatively. We also show that traditional metrics are not suitable for evaluating the task due to their inability of focusing on local details and comparing dissimilar inputs, in absence of ground truth data. We thus evaluate appearance transfer quality with a GPT-based system objectively ranking outputs, ensuring robust and human-like assessment, as further confirmed by our user study. Beyond showcased scenarios, our method is general and could be extended to different types of diffusion models and guidance functions.[115] C-arm Guidance: A Self-supervised Approach To Automated Positioning During Stroke Thrombectomy
Ahmad Arrabi,Jay hwasung Jung,J Le,A Nguyen,J Reed,E Stahl,Nathan Franssen,Scott Raymond,Safwan Wshah
Main category: cs.CV
TL;DR: 提出一种自监督深度学习框架,通过回归预文本任务分类骨骼标志,在下游分类任务中显著提升性能,未来将扩展至卒中取栓术中C臂的全自动控制。
Details
Motivation: 为了提高缺血性卒中取栓术的效率和安全性,减少对人力和资源的依赖,需要自动化关键手术环节。 Method: 采用基于回归的预文本任务设计自监督学习框架,用于分类多种骨骼解剖标志,并利用该框架提升下游分类任务的性能。 Result: 实验表明,该模型在回归和分类任务上均优于现有方法,位置预文本任务显著提升了分类性能。 Conclusion: 所提出的自监督框架有效提升了关键解剖结构识别能力,为未来实现取栓术中C臂的全自动控制奠定了基础。 Abstract: Thrombectomy is one of the most effective treatments for ischemic stroke, but it is resource and personnel-intensive. We propose employing deep learning to automate critical aspects of thrombectomy, thereby enhancing efficiency and safety. In this work, we introduce a self-supervised framework that classifies various skeletal landmarks using a regression-based pretext task. Our experiments demonstrate that our model outperforms existing methods in both regression and classification tasks. Notably, our results indicate that the positional pretext task significantly enhances downstream classification performance. Future work will focus on extending this framework toward fully autonomous C-arm control, aiming to optimize trajectories from the pelvis to the head during stroke thrombectomy procedures. All code used is available at https://github.com/AhmadArrabi/C_arm_guidance[116] DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization
Thanh-Huy Nguyen,Hoang-Thien Nguyen,Vi Vu,Ba-Thinh Lam,Phat Huynh,Tianyang Wang,Xingjian Li,Ulas Bagci,Min Xu
Main category: cs.CV
TL;DR: 提出DuetMatch,一种具有异步优化的双分支半监督框架,用于医学图像分割,通过解耦Dropout扰动、成对CutMix交叉指导和一致性匹配策略提升模型性能。
Details
Motivation: 医学图像中注释数据有限,半监督学习具有吸引力,但现有教师-学生框架在联合优化时存在收敛和稳定性问题。 Method: 提出DuetMatch框架,采用双分支异步优化(分别优化编码器或解码器),引入解耦Dropout扰动增强正则化,设计成对CutMix交叉指导以增加模型多样性,并使用一致性匹配减少伪标签的确认偏误。 Result: 在ISLES2022和BraTS等脑MRI分割数据集上实验表明,DuetMatch持续优于当前最先进的方法,在多种半监督分割场景中表现出更高的有效性和鲁棒性。 Conclusion: DuetMatch通过异步优化和多种改进策略,有效提升了医学图像半监督分割的性能与稳定性,适用于标注数据稀缺的场景。 Abstract: The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.[117] Automated C-Arm Positioning via Conformal Landmark Localization
Ahmad Arrabi,Jay Hwasung Jung,Jax Luo,Nathan Franssen,Scott Raymond,Safwan Wshah
Main category: cs.CV
TL;DR: 本文提出了一种利用X射线图像自主导航C臂至预定义解剖标志的管道,结合概率损失和骨骼姿态正则化训练框架,并通过合成数据验证了其在多种架构下的高定位精度和良好校准的预测边界。
Details
Motivation: 临床中C臂的手动对齐会增加辐射暴露和操作延迟,因此需要一种准确可靠的自动化C臂定位方法。 Method: 利用X射线图像输入,模型预测朝向身体各目标标志点的3D位移向量;结合概率损失与骨骼姿态正则化进行训练,并使用共形预测校准模型中的偶然性和认知不确定性,生成3D置信区域。 Result: 在基于DeepDRR生成的合成X射线数据集上验证,结果显示多种网络架构均具有良好的定位精度和校准的预测范围。 Conclusion: 该管道有望作为安全可靠的自主C臂系统的关键组件,提升介入手术效率并减少辐射暴露。 Abstract: Accurate and reliable C-arm positioning is essential for fluoroscopy-guided interventions. However, clinical workflows rely on manual alignment that increases radiation exposure and procedural delays. In this work, we present a pipeline that autonomously navigates the C-arm to predefined anatomical landmarks utilizing X-ray images. Given an input X-ray image from an arbitrary starting location on the operating table, the model predicts a 3D displacement vector toward each target landmark along the body. To ensure reliable deployment, we capture both aleatoric and epistemic uncertainties in the model's predictions and further calibrate them using conformal prediction. The derived prediction regions are interpreted as 3D confidence regions around the predicted landmark locations. The training framework combines a probabilistic loss with skeletal pose regularization to encourage anatomically plausible outputs. We validate our approach on a synthetic X-ray dataset generated from DeepDRR. Results show not only strong localization accuracy across multiple architectures but also well-calibrated prediction bounds. These findings highlight the pipeline's potential as a component in safe and reliable autonomous C-arm systems. Code is available at https://github.com/AhmadArrabi/C_arm_guidance_APAH[118] Cost Savings from Automatic Quality Assessment of Generated Images
Xavier Giro-i-Nieto,Nefeli Andreou,Anqi Liang,Manel Baradad,Francesc Moreno-Noguer,Aleix Martinez
Main category: cs.CV
TL;DR: 提出了一种自动图像质量评估(IQA)预过滤方法,显著降低了生成图像后处理的成本,在背景修复用例中实现了51.61%的成本节约。
Details
Motivation: 现有生成图像质量未达到传统摄影标准,需人工进行质量评估,成本高且效率低。 Method: 设计了一个估算自动IQA引擎成本节约的公式,并在背景修复任务中应用AutoML方案进行验证。 Result: 通过引入自动预过滤阶段,提升了送审图像的整体质量,在实际用例中实现了51.61%的成本节省。 Conclusion: 自动IQA预过滤能有效降低生成图像生产流程中的评估成本,具有实际应用价值。 Abstract: Deep generative models have shown impressive progress in recent years, making it possible to produce high quality images with a simple text prompt or a reference image. However, state of the art technology does not yet meet the quality standards offered by traditional photographic methods. For this reason, production pipelines that use generated images often include a manual stage of image quality assessment (IQA). This process is slow and expensive, especially because of the low yield of automatically generated images that pass the quality bar. The IQA workload can be reduced by introducing an automatic pre-filtering stage, that will increase the overall quality of the images sent to review and, therefore, reduce the average cost required to obtain a high quality image. We present a formula that estimates the cost savings depending on the precision and pass yield of a generic IQA engine. This formula is applied in a use case of background inpainting, showcasing a significant cost saving of 51.61% obtained with a simple AutoML solution.[119] Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Zheng Huang,Enpei Zhang,Yinghao Cai,Weikang Qiu,Carl Yang,Elynn Chen,Xiang Zhang,Rex Ying,Dawei Zhou,Yujun Yan
Main category: cs.CV
TL;DR: 本文提出PRISM模型,通过将fMRI信号投影到结构化文本空间来重建视觉刺激,利用对象中心扩散模块和属性关系搜索模块提升重建质量,在真实数据集上优于现有方法,感知损失降低达8%。
Details
Motivation: 探索何种潜在空间最能有效支持从fMRI信号重建视觉刺激,并解决现有方法在表征视觉刺激组成性结构(如对象、属性和关系)方面的不足。 Method: 提出PRISM模型,将fMRI信号映射到结构化文本空间;引入对象中心扩散模块以组合对象生成图像,并设计属性关系搜索模块以匹配神经活动的关键属性与关系。 Result: 在真实世界数据集上实验表明,该方法显著优于现有技术,感知损失最高降低8%,验证了结构化文本空间在fMRI到图像重建中的有效性。 Conclusion: 结构化文本空间是连接fMRI信号与图像重建的有效中间表示,适应文本表示与生成模型的结构有助于更好地解码大脑视觉信息。 Abstract: Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.[120] Data-Centric AI for Tropical Agricultural Mapping: Challenges, Strategies and Scalable Solutions
Mateus Pinto da Silva,Sabrina P. L. P. Correa,Hugo N. Oliveira,Ian M. Nunes,Jefersson A. dos Santos
Main category: cs.CV
TL;DR: 本文提倡采用以数据为中心的人工智能(DCAI)方法,提升热带地区大规模农业遥感制图中数据质量和模型鲁棒性。
Details
Motivation: 热带农业遥感制图面临标注数据质量低、标注成本高、数据变异性大和区域泛化难等问题,传统以模型为中心的方法受限。 Method: 综述并优先选择置信学习、核心集选择、数据增强和主动学习等技术,提出一个包含9种成熟方法的实用DCAI流程。 Result: 明确了25种策略在农业制图中的适用性,并验证了所提DCAI流程在提升数据质量和模型性能方面的有效性。 Conclusion: 以数据为中心的方法能更有效地应对热带农业遥感中的挑战,提升模型的可扩展性和实际适用性。 Abstract: Mapping agriculture in tropical areas through remote sensing presents unique challenges, including the lack of high-quality annotated data, the elevated costs of labeling, data variability, and regional generalisation. This paper advocates a Data-Centric Artificial Intelligence (DCAI) perspective and pipeline, emphasizing data quality and curation as key drivers for model robustness and scalability. It reviews and prioritizes techniques such as confident learning, core-set selection, data augmentation, and active learning. The paper highlights the readiness and suitability of 25 distinct strategies in large-scale agricultural mapping pipelines. The tropical context is of high interest, since high cloudiness, diverse crop calendars, and limited datasets limit traditional model-centric approaches. This tutorial outlines practical solutions as a data-centric approach for curating and training AI models better suited to the dynamic realities of tropical agriculture. Finally, we propose a practical pipeline using the 9 most mature and straightforward methods that can be applied to a large-scale tropical agricultural mapping project.[121] StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Nyle Siddiqui,Rohit Gupta,Sirnam Swetha,Mubarak Shah
Main category: cs.CV
TL;DR: 本文提出了一种灵活的训练方法,利用并提升了状态空间模型(SSM)在视频理解中的时空适应性,所提出的StretchySnake模型在多种动作识别基准上显著优于Transformer和其他SSM基线模型。
Details
Motivation: 现有的视频理解训练方法主要针对Transformer设计,未能充分发挥SSM在处理长序列时的线性复杂度和隐状态递归优势,导致模型在面对训练中未见的时空分辨率时性能下降,即存在时空不灵活性问题。 Method: 提出一种灵活训练方法,在训练过程中以不同的时空分辨率采样视频,并动态插值模型权重以适应任意时空尺度,从而增强SSM的适应能力。文中比较了五种灵活训练变体,确定了对视频SSM最有效的策略。 Result: StretchySnake在短动作(UCF-101、HMDB-51)和长动作(COIN、Breakfast)基准上比Transformer和其他SSM基线最高提升28%,并在细粒度动作识别(SSV2、Diving-48)任务中表现出强适应性。 Conclusion: 该方法提供了一种简单即插即用的训练方案,使视频SSM更具鲁棒性、分辨率无关性和高效性,适用于多样化的动作识别场景。 Abstract: State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model's ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.[122] VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction
Djamel Eddine Boukhari
Main category: cs.CV
TL;DR: 提出了一种新的异构集成架构VM-BeautyNet,结合Vision Transformer和Mamba模型的优势,在面部美感预测任务中实现了最先进的性能。
Details
Motivation: 现有深度学习模型在捕捉影响人类审美判断的全局、整体面部特征方面存在不足,而Vision Transformers虽然能有效建模长距离空间关系,但其计算复杂度较高。 Method: 设计了一个融合Vision Transformer和基于Mamba的视觉模型的异构集成架构VM-BeautyNet,利用两者在特征提取上的互补优势。 Result: 在SCUT-FBP5500数据集上,VM-BeautyNet达到了0.9212的皮尔逊相关系数、0.2085的平均绝对误差和0.2698的均方根误差,表现优于现有方法。 Conclusion: VM-BeautyNet通过结合全局结构建模与高效序列建模,在面部美感预测任务中取得了卓越性能,并通过Grad-CAM提供了可解释性分析,为计算美学提供了新范式。 Abstract: Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.[123] Designing a Convolutional Neural Network for High-Accuracy Oral Cavity Squamous Cell Carcinoma (OCSCC) Detection
Vishal Manikanden,Aniketh Bandlamudi,Daniel Haehn
Main category: cs.CV
TL;DR: 本研究开发了一种卷积神经网络(CNN)并结合图像采集硬件,用于口腔鳞状细胞癌(OCSCC)的早期检测。模型在4293张图像上训练,并在不同分辨率下评估其准确率、精确率、召回率和mAP。结果表明,随着图像分辨率提高,预测准确性呈对数增长,但存在边际递减效应。同时开发了增强硬件和应用程序以提升检测效能和可及性。
Details
Motivation: 由于OCSCC早期症状隐匿、发展位置深且生长缓慢,常导致漏诊和可预防性死亡,亟需一种高效、精准的早期检测方法。 Method: 使用4293张包含良恶性肿瘤及阴性样本的图像训练CNN,采用图像分割与核矩阵处理技术进行模式识别;构建图像采集与处理硬件系统;测试集包含不同分辨率的图像,评估指标包括精确率、召回率和mAP,并开发配套应用实现开放访问。 Result: 高分辨率图像显著提升预测准确性,但提升幅度随分辨率增加呈对数下降趋势,表现出边际效益递减;CNN在多分辨率下表现稳定,硬件系统有效提升图像质量。 Conclusion: 结合高分辨率成像硬件与训练良好的CNN可有效提升OCSCC早期检测能力,具备临床应用潜力,未来需平衡分辨率成本与检测效益。 Abstract: Oral Cavity Squamous Cell Carcinoma (OCSCC) is the most common type of head and neck cancer. Due to the subtle nature of its early stages, deep and hidden areas of development, and slow growth, OCSCC often goes undetected, leading to preventable deaths. However, properly trained Convolutional Neural Networks (CNNs), with their precise image segmentation techniques and ability to apply kernel matrices to modify the RGB values of images for accurate image pattern recognition, would be an effective means for early detection of OCSCC. Pairing this neural network with image capturing and processing hardware would allow increased efficacy in OCSCC detection. The aim of our project is to develop a Convolutional Neural Network trained to recognize OCSCC, as well as to design a physical hardware system to capture and process detailed images, in order to determine the image quality required for accurate predictions. A CNN was trained on 4293 training images consisting of benign and malignant tumors, as well as negative samples, and was evaluated for its precision, recall, and Mean Average Precision (mAP) in its predictions of OCSCC. A testing dataset of randomly assorted images of cancerous, non-cancerous, and negative images was chosen, and each image was altered to represent 5 common resolutions. This test data set was thoroughly analyzed by the CNN and predictions were scored on the basis of accuracy. The designed enhancement hardware was used to capture detailed images, and its impact was scored. An application was developed to facilitate the testing process and bring open access to the CNN. Images of increasing resolution resulted in higher-accuracy predictions on a logarithmic scale, demonstrating the diminishing returns of higher pixel counts.[124] Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset
Claire McLean,Makenzie Meendering,Tristan Swartz,Orri Gabbay,Alexandra Olsen,Rachel Jacobs,Nicholas Rosen,Philippe de Bree,Tony Garcia,Gadsden Merrill,Jake Sandakly,Julia Buffalini,Neham Jain,Steven Krenn,Moneish Kumar,Dejan Markovic,Evonne Ng,Fabian Prada,Andrew Saba,Siwei Zhang,Vasu Agrawal,Tim Godisart,Alexander Richard,Michael Zollhoefer
Main category: cs.CV
TL;DR: Embody 3D是Meta Codec Avatars实验室推出的一个大规模多模态3D运动数据集,包含439名参与者超过500小时的3D动作数据,涵盖单人动作和多人互动场景。
Details
Motivation: 为了推动虚拟化身、人机交互和行为分析等领域的发展,需要一个高质量、多样化的3D人体运动数据集。 Method: 通过多摄像头系统采集439名参与者的3D运动数据,包括身体和手部追踪、文本标注以及独立音频轨道,覆盖多种单人和多人互动行为。 Result: 构建了包含5400万帧的大型3D运动数据集,支持多种任务如动作识别、对话行为建模和情感状态分析。 Conclusion: Embody 3D为研究复杂的人类行为和社交互动提供了丰富且高精度的数据资源,具有广泛的应用潜力。 Abstract: The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.[125] Proactive Scene Decomposition and Reconstruction
Baicheng Li,Zike Yan,Dong Wu,Hongbin Zha
Main category: cs.CV
TL;DR: 本文提出了一种基于人类与物体交互的主动式场景分解与重建新方法,利用高斯点阵技术实现动态场景的精确建模与实时渲染。
Details
Motivation: 传统静态物体级重建方法在动态环境中存在歧义,难以准确建模场景动态变化,因此需要一种能够利用人类行为线索进行在线、渐进式重建的方法。 Method: 通过分析第一人称视角视频中的人-物交互,迭代地分解和重构环境,结合高斯点阵技术进行动态场景建模,并集成相机位姿估计、实例分割与在线地图更新等多个任务。 Result: 在多个真实场景中验证了该方法的有效性,实现了高保真、高效的动态场景渲染,优于传统物体级重建方法。 Conclusion: 所提出的方法能有效利用人类行为线索,在动态环境中实现准确、一致且可逐步优化的场景分解与重建。 Abstract: Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.[126] Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
Yue Zheng,Xiufang Shi,Jiming Chen,Yuanchao Shu
Main category: cs.CV
TL;DR: Cerberus是一种用于实时视频异常检测的两阶段级联系统,结合轻量级过滤与基于视觉语言模型(VLM)的细粒度推理,在保持高精度的同时显著提升推理速度。
Details
Motivation: 现有的基于视觉语言模型(VLM)的视频异常检测方法虽然具有强大的零样本检测能力,但计算成本高且视觉定位性能不稳定,难以满足实时部署需求。 Method: Cerberus采用两阶段级联架构:离线学习正常行为规则,并在在线推理时结合轻量级过滤和VLM的精细推理;通过运动掩码提示引导VLM关注运动相关区域,并利用基于规则的偏差检测将异常识别为对正常模式的偏离。 Result: 在四个数据集上的实验表明,Cerberus在NVIDIA L40S GPU上平均达到57.68 fps,相比现有VLM方法加速151.79倍,同时保持97.2%的准确率,接近最先进方法的性能。 Conclusion: Cerberus在保证高检测精度的前提下大幅提升了推理效率,为基于VLM的视频异常检测提供了可行的实时解决方案。 Abstract: Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.[127] OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
Ryoto Miyamoto,Xin Fan,Fuyuko Kido,Tsuneo Matsumoto,Hayato Yamana
Main category: cs.CV
TL;DR: OpenLVLM-MIA是一个新的基准,揭示了在评估针对大型视觉语言模型的成员推断攻击时存在的根本挑战,指出先前高攻击成功率多源于数据集构建中的分布偏差而非真实成员状态识别。
Details
Motivation: 现有MIA评估结果可能受到数据集分布偏差的影响,导致高估攻击性能,缺乏公平、可控的评估环境。 Method: 构建了一个包含6,000张图像的受控基准OpenLVLM-MIA,平衡成员与非成员样本分布,并提供三个训练阶段的真实成员标签。 Result: 实验表明,在无偏条件下,当前最先进的MIA方法性能退化至随机水平。 Conclusion: OpenLVLM-MIA提供了一个透明、无偏的评估基准,揭示了当前LVLMs上MIA研究的局限性,并为开发更强的隐私保护技术奠定了基础。 Abstract: OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions. By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.[128] Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation
Rui Yang,Huining Li,Yiyi Long,Xiaojun Wu,Shengfeng He
Main category: cs.CV
TL;DR: 提出了一种无需训练的框架Stroke2Sketch,通过跨图像笔画注意力机制实现参考风格的精确笔画属性迁移,同时保持内容语义结构。
Details
Motivation: 需要在生成草图时精确迁移参考风格的笔画属性(如线条粗细、变形、纹理稀疏性),同时保持语义结构和内容保真度。 Method: 提出了Stroke2Sketch,引入嵌入自注意力层的跨图像笔画注意力机制,建立细粒度语义对应,实现笔画属性的准确迁移;结合自适应对比度增强和语义聚焦注意力,增强内容保持和前景突出。 Result: Stroke2Sketch能生成风格忠实且类似手工绘制的草图,在笔画表达控制和语义一致性方面优于现有方法。 Conclusion: 该方法无需训练,有效实现了高质量的风格化草图生成,兼顾风格迁移精度与内容结构保持。 Abstract: Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence. Codes are available at https://github.com/rane7/Stroke2Sketch.[129] Scaling Laws for Deepfake Detection
Wenhao Wang,Longqi Cai,Taihong Xiao,Yuxiao Wang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本研究系统分析了深度伪造检测任务中的扩展规律,构建了目前最大的数据集ScaleDF(包含51个真实图像域和102种生成方法),发现检测误差随真实域或伪造方法数量增加呈现幂律衰减,揭示了数据规模对检测性能的可预测影响。
Details
Motivation: 现有数据集无法满足深度伪造检测中扩展规律研究的需求,且缺乏对模型性能随真实图像域、伪造方法和训练数据规模变化的系统分析。 Method: 构建大规模数据集ScaleDF,包含580万以上真实图像和880万以上伪造图像,系统评估模型在不同规模的真实域数量、伪造方法数量和训练样本数量下的检测性能,并拟合幂律关系。 Result: 发现检测误差随真实图像域数量或深度伪造方法数量增加呈现可预测的幂律衰减,类似于大语言模型中的扩展规律;验证了预训练和数据增强在扩展下的作用,并揭示了扩展本身的局限性。 Conclusion: 深度伪造检测性能遵循可预测的扩展规律,可通过增加数据多样性(如新域或新伪造方法)来提升鲁棒性,为应对不断演进的伪造技术提供了数据驱动的解决方案。 Abstract: This paper presents a systematic study of scaling laws for the deepfake detection task. Specifically, we analyze the model performance against the number of real image domains, deepfake generation methods, and training images. Since no existing dataset meets the scale requirements for this research, we construct ScaleDF, the largest dataset to date in this field, which contains over 5.8 million real images from 51 different datasets (domains) and more than 8.8 million fake images generated by 102 deepfake methods. Using ScaleDF, we observe power-law scaling similar to that shown in large language models (LLMs). Specifically, the average detection error follows a predictable power-law decay as either the number of real domains or the number of deepfake methods increases. This key observation not only allows us to forecast the number of additional real domains or deepfake methods required to reach a target performance, but also inspires us to counter the evolving deepfake technology in a data-centric manner. Beyond this, we examine the role of pre-training and data augmentations in deepfake detection under scaling, as well as the limitations of scaling itself.[130] Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention
Yuyao Zhang,Yu-Wing Tai
Main category: cs.CV
TL;DR: 提出Scale-DiT,一种结合分层局部注意力与低分辨率全局引导的扩散模型框架,实现高效、可扩展且语义连贯的4K超高清文本生成图像。
Details
Motivation: 现有扩散模型受限于注意力机制的二次复杂度和高质量4K训练数据的缺乏,难以实现超高清图像生成。 Method: 将高分辨率潜在表示划分为局部窗口以降低注意力复杂度至近线性,并引入带缩放位置锚点的低分辨率潜在码提供全局语义;通过轻量LoRA模块融合全局与局部路径,并采用希尔伯特曲线重排和融合核优化推理效率。 Result: Scale-DiT在无需额外4K训练数据的情况下,实现超过2倍的推理加速和更低内存占用,支持稳定生成4K图像,在FID、IS、CLIP Score等指标和视觉质量上优于或媲美依赖原生4K训练的最先进方法。 Conclusion: 分层局部注意力结合低分辨率全局引导是一种高效且有前景的超高清图像生成方案。 Abstract: Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.[131] DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution
Yi Wei,Shunpu Tang,Liang Zhao,Qiangian Yang
Main category: cs.CV
TL;DR: DiffusionX是一个云-边协同的多轮提示生成框架,通过轻量级设备端模型快速生成预览图像,云端大模型进行最终优化,并引入噪声水平预测器动态平衡计算负载,在降低延迟和云资源负担的同时保持高质量图像生成。
Details
Motivation: 扩散模型在图像生成上取得显著进展,但生成过程计算开销大,用户需多次迭代调整提示词,导致延迟增加并加重云端资源负担,亟需高效支持多轮交互式生成的解决方案。 Method: 提出DiffusionX框架:设备端运行轻量模型实时生成预览图以支持快速交互,云端大模型负责最终高质生成;引入噪声水平预测器动态分配设备与云端的计算任务,优化延迟与资源消耗的权衡。 Result: 实验表明,相比Stable Diffusion v1.5,DiffusionX平均生成时间减少15.8%,图像质量相当;仅比Tiny-SD慢0.9%,但图像质量显著更优,展现出高效性、可扩展性和低开销优势。 Conclusion: DiffusionX通过云边协同与动态负载分配,有效提升了多轮提示图像生成的效率与用户体验,为资源受限环境下的生成式AI应用提供了可扩展且实用的解决方案。 Abstract: Recent advances in diffusion models have driven remarkable progress in image generation. However, the generation process remains computationally intensive, and users often need to iteratively refine prompts to achieve the desired results, further increasing latency and placing a heavy burden on cloud resources. To address this challenge, we propose DiffusionX, a cloud-edge collaborative framework for efficient multi-round, prompt-based generation. In this system, a lightweight on-device diffusion model interacts with users by rapidly producing preview images, while a high-capacity cloud model performs final refinements after the prompt is finalized. We further introduce a noise level predictor that dynamically balances the computation load, optimizing the trade-off between latency and cloud workload. Experiments show that DiffusionX reduces average generation time by 15.8% compared with Stable Diffusion v1.5, while maintaining comparable image quality. Moreover, it is only 0.9% slower than Tiny-SD with significantly improved image quality, thereby demonstrating efficiency and scalability with minimal overhead.[132] TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement
Haiyue Sun,Qingdong He,Jinlong Peng,Peng Tang,Jiangning Zhang,Junwei Zhu,Xiaobin Hu,Shuicheng Yan
Main category: cs.CV
TL;DR: 本文提出了一种名为TokenAR的框架,通过引入token级别的增强机制来解决多参考图像生成中的身份混淆问题,显著提升了自回归模型在条件图像生成中的表现。
Details
Motivation: 现有的自回归模型在多参考图像生成中难以解耦不同参考身份,导致身份一致性差。本文旨在解决这一身份混淆问题。 Method: 提出了TokenAR框架,包含三部分:Token Index Embedding、Instruct Token Injection和身份-令牌解耦策略(ITD),从token层面增强对不同身份的表示能力。 Result: 在多参考图像生成任务上超越现有最先进模型,实现了良好的身份一致性和高质量的背景重建。同时发布了首个大规模开源多参考图像生成数据集InstructAR Dataset(28K训练样本)。 Conclusion: TokenAR通过token级增强有效缓解了多参考生成中的身份混淆问题,为自回归模型在复杂条件生成任务中的应用提供了新思路,具有较强的实用性和扩展性。 Abstract: Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.This token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see https://github.com/lyrig/TokenAR[133] RL makes MLLMs see better than SFT
Junha Song,Sangdoo Yun,Dongyoon Han,Jaegul Choo,Byeongho Heo
Main category: cs.CV
TL;DR: 本文研究了多模态语言模型(MLLM)中视觉编码器在不同训练策略(监督微调SFT与强化学习RL)下的表现,发现RL能显著提升视觉表征的强度和定位精度,并提出一种高效方法PIVOT,仅用不到1%的传统预训练计算成本即可构建强大的MLLM视觉编码器。
Details
Motivation: 当前MLLM研究过度依赖大语言模型骨干,忽视视觉编码器的作用,且缺乏对新兴RL训练范式如何影响视觉表征的理解。 Method: 通过在多种任务(如ImageNet分类、分割、梯度可视化)上对SFT和RL训练的MLLM进行综合实验分析,系统评估训练策略对视觉编码器的影响,并基于发现提出PIVOT方法。 Result: RL训练相比SFT在视觉相关VQA任务上表现更优,且能生成更强、更精准定位的视觉表征;PIVOT训练的视觉编码器在性能上可超越更大、更复杂模型,同时计算成本极低。 Conclusion: MLLM的后训练策略深刻重塑其视觉表征能力,强化学习优于监督微调;PIVOT提供了一条高效提升MLLM视觉骨干性能的新路径。 Abstract: A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/[134] On the Provable Importance of Gradients for Language-Assisted Image Clustering
Bo Peng,Jie Lu,Guangquan Zhang,Zhen Fang
Main category: cs.CV
TL;DR: 本文提出了一种基于梯度的框架GradNorm,用于语言辅助图像聚类(LaIC),通过理论证明和实验验证其在过滤正向名词和提升聚类性能方面的有效性。
Details
Motivation: 在缺乏真实类别名称的情况下,如何从无标签文本数据中有效筛选与目标图像语义相近的正向名词是语言辅助图像聚类(LaIC)的核心挑战。现有方法依赖CLIP特征空间但缺乏理论支持。 Method: 提出GradNorm框架,利用交叉熵预测分布与softmax输出之间的梯度幅度来衡量名词的正向性,并提供理论误差界以量化正向名词的可分性。 Result: 理论分析表明GradNorm包含了现有过滤策略作为特例,并在多个基准上实现了最先进的聚类性能。 Conclusion: GradNorm具有坚实的理论基础和优异的实证表现,为语言辅助图像聚类中的名词过滤问题提供了有效且通用的解决方案。 Abstract: This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.[135] MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization
Pulin Li,Guocheng Wu,Li Yin,Yuxin Zheng,Wei Zhang,Yanjie Zhou
Main category: cs.CV
TL;DR: 本文提出了首个面向社会化制造中异常检测的基准数据集MIRAD,旨在解决大规模个性化生产中的质量控制难题。该数据集涵盖高度定制化产品、分布式制造节点和成像异质性三大挑战,并对多种前沿异常检测方法进行了评估,结果表明现有模型在真实个性化生产场景中性能显著下降,凸显了该领域亟待解决的问题。
Details
Motivation: 社会化制造带来了产品质量控制的新挑战,尤其是缺陷检测面临产品高度定制化、小批量碎片化订单和分布式成像环境差异等问题,缺乏真实数据集和专用算法阻碍了相关研究的发展。 Method: 构建了名为MIRAD的大规模个性化鲁棒异常检测数据集,包含多样化个性化产品、来自六个地理分布制造节点的数据以及显著的成像异质性;并在该数据集上系统评估了包括单类、多类和零样本在内的最先进异常检测方法。 Result: 在MIRAD数据集上的实验表明,现有SOTA异常检测方法相比传统基准性能大幅下降,验证了现实个体化生产环境中缺陷检测的复杂性和挑战性。 Conclusion: MIRAD为社会化制造中的鲁棒质量控制研究提供了贴近实际的基准平台,弥合了工业需求与学术研究之间的差距,推动面向Industry 5.0的质量保证技术发展。 Abstract: Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at https://github.com/wu33learn/MIRAD.[136] Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
Mohammad Javad Ahmadi,Iman Gandomi,Parisa Abdi,Seyed-Farzad Mohammadi,Amirhossein Taslimi,Mehdi Khodaparast,Hassan Hashemi,Mahdi Tavakoli,Hamid D. Taghirad
Main category: cs.CV
TL;DR: 本文介绍了一个包含3000个白内障超声乳化手术视频的大规模数据集,涵盖两个医疗中心、不同经验水平的外科医生,并提供了四层详细标注,支持手术流程识别、场景分割和自动化技能评估等AI任务的基准测试。
Details
Motivation: 现有的白内障手术数据集缺乏足够的多样性和深度标注,难以训练出具有泛化能力的深度学习模型,因此需要构建一个更全面、多中心、多层次标注的数据集以推动计算机辅助手术系统的发展。 Method: 收集来自两个手术中心的3000例白内障超声乳化手术视频,由不同经验水平的医生操作,并进行四层标注:手术阶段时序划分、器械与解剖结构的实例分割、器械-组织交互追踪,以及基于ICO-OSCAR等标准的定量技能评分;通过多个基准实验验证数据集的技术质量,并建立领域自适应基线模型。 Result: 该数据集支持多种手术AI任务的建模与评估,在工作流识别、场景分割和技能评估方面表现出良好的技术质量,并成功建立了跨中心相位识别的领域自适应基线。 Conclusion: 该多中心、多层次标注的白内障手术视频数据集为训练和评估通用性强的手术AI模型提供了重要资源,有助于推动计算机辅助手术系统的发展。 Abstract: The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform?usp=dialog).[137] iWatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance
Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra
Main category: cs.CV
TL;DR: iWatchRoadv2是一个用于实时检测印度道路坑洼的端到端自动化平台,结合YOLO模型、GPS定位与OpenStreetMap可视化,支持自动报警、维修问责和智能治理。
Details
Motivation: 印度道路维护不足,坑洼频发带来安全隐患,亟需高效、可扩展的自动化监测方案以提升道路管理效率与透明度。 Method: 构建包含7000多张行车记录图像的数据集,微调Ultralytics YOLO模型进行坑洼检测;通过OCR时间戳与GPS日志同步实现精确定位,并结合后端数据库管理道路段与承包商信息,开发基于Web的可视化与预警系统。 Result: 实现了高精度的实时坑洼检测与地理标记,系统能自动向承包商和官员发送道路恶化警报,提供可操作的分析界面,支持维修规划、预算分配与质量评估。 Conclusion: iWatchRoadv2通过自动化监测与智能治理机制,推动数据驱动的城市管理和可持续的道路维护,具备在城乡广泛部署的成本效益与可扩展性。 Abstract: Road potholes pose significant safety hazards and maintenance challenges, particularly on India's diverse and under-maintained road networks. This paper presents iWatchRoadv2, a fully automated end-to-end platform for real-time pothole detection, GPS-based geotagging, and dynamic road health visualization using OpenStreetMap (OSM). We curated a self-annotated dataset of over 7,000 dashcam frames capturing diverse Indian road conditions, weather patterns, and lighting scenarios, which we used to fine-tune the Ultralytics YOLO model for accurate pothole detection. The system synchronizes OCR-extracted video timestamps with external GPS logs to precisely geolocate each detected pothole, enriching detections with comprehensive metadata, including road segment attribution and contractor information managed through an optimized backend database. iWatchRoadv2 introduces intelligent governance features that enable authorities to link road segments with contract metadata through a secure login interface. The system automatically sends alerts to contractors and officials when road health deteriorates, supporting automated accountability and warranty enforcement. The intuitive web interface delivers actionable analytics to stakeholders and the public, facilitating evidence-driven repair planning, budget allocation, and quality assessment. Our cost-effective and scalable solution streamlines frame processing and storage while supporting seamless public engagement for urban and rural deployments. By automating the complete pothole monitoring lifecycle, from detection to repair verification, iWatchRoadv2 enables data-driven smart city management, transparent governance, and sustainable improvements in road infrastructure maintenance. The platform and live demonstration are accessible at https://smlab.niser.ac.in/project/iwatchroad.[138] Demeter: A Parametric Model of Crop Plant Morphology from the Real World
Tianhang Cheng,Albert J. Zhai,Evan Z. Chen,Rui Zhou,Yawen Deng,Zitong Li,Kejie Zhao,Janice Shiu,Qianyu Zhao,Yide Xu,Xinlei Wang,Yuan Shen,Sheng Wang,Lisa Ainsworth,Kaiyu Guan,Shenlong Wang
Main category: cs.CV
TL;DR: 本文提出了Demeter,一种数据驱动的3D参数化植物形态模型,能够处理不同物种的拓扑变化,并建模三种形状变异:关节运动、子部件形状变化和非刚性变形。作者还收集了一个大规模真实大豆植株数据集用于验证,实验表明该模型在形状合成、结构重建和生物物理过程模拟方面表现良好。
Details
Motivation: 现有的3D参数化模型在人类和动物上已有广泛应用,但在植物建模方面仍缺乏同样表达能力强的方法。植物具有复杂的形态变化和拓扑多样性,需要能够同时捕捉其结构、关节、形状变化和变形能力的模型,因此需要开发适用于植物(尤其是作物)的新型参数化模型。 Method: 提出Demeter,一个数据驱动的参数化模型,将植物形态的关键因素(如拓扑、形状、关节和变形)编码为紧凑的学习表示。该模型能处理变化的形状拓扑,并建模三种主要的形状变异来源:关节运动、子部件形状差异和非刚性变形。为支持研究,作者从大豆农场采集了大规模带真值的3D数据集作为测试平台。 Result: 实验结果表明,Demeter在植物形状的生成、结构重建以及生物物理过程(如生长模拟)方面均表现出有效性。模型能够准确重建复杂植物结构,并支持下游应用如仿真与分析。此外,数据和代码已公开发布。 Conclusion: Demeter是首个能够统一建模植物多种形态变化的数据驱动参数化模型,填补了现有3D形状模型在植物领域的能力空白,特别是在作物建模方面具有重要应用潜力。 Abstract: Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data is available at https://tianhang-cheng.github.io/Demeter/.[139] SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
Yeh Keng Hao,Hsu Tzu Wei,Sun Min
Main category: cs.CV
TL;DR: 提出一种轻量级编码器-解码器框架,结合稀疏卷积、SPLite解码器和量化感知训练,在保持高精度的同时显著提升AR/VR边缘设备上的手部姿态估计效率。
Details
Motivation: AR/VR设备对实时推理、低功耗和低延迟的需求使得在边缘设备上部署深度学习模型面临效率与性能平衡的挑战。 Method: 采用ResNet-18骨干网络结合稀疏卷积以利用手部姿态图像的固有稀疏性;设计SPLite解码器提升解码帧率;应用量化感知训练以减少内存占用并保持精度。 Result: 在Raspberry Pi 5上实现端到端效率提升42%,解码帧率提升3.1倍,整体CPU加速2.98倍;在FreiHAND数据集上PA-MPJPE仅从9.0 mm微增至9.1 mm,且在复合基准测试中表现与SOTA方法相当。 Conclusion: 所提方法在显著提升计算效率的同时保持了高精度,适用于资源受限的AR/VR边缘设备部署。 Abstract: With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.[140] REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Changyue Shi,Minghao Chen,Yiping Mao,Chuxiao Yang,Xinyuan Hu,Jiajun Ding,Zhou Yu
Main category: cs.CV
TL;DR: 本文提出了REALM,一种无需大量3D特定后训练的MLLM代理框架,通过3D高斯点阵渲染实现开放世界的基于推理的分割。采用从全局到局部的空间接地策略,先并行输入多个全局视图进行粗略定位,再合成近距离新视图进行细粒度局部分割,从而生成准确一致的3D掩码。实验表明,REALM在解释显性和隐性指令方面表现出色,并支持多种3D交互任务。
Details
Motivation: 现有的3D分割方法难以理解模糊的、基于推理的指令,而擅长此类推理的2D视觉语言模型缺乏内在的3D空间理解能力。因此,需要一个能够结合两者优势的方法来解决这一问题。 Method: 提出了一种创新的MLLM代理框架REALM,直接在3D高斯点阵表示上执行分割,利用其渲染逼真新视角的能力。引入了从全局到局部的空间接地策略:首先并行输入多个全局视图以粗略定位目标对象;然后合成几个特写的新视角进行精细的局部分割。 Result: 实验结果显示,REALM在LERF、3D-OVS以及新提出的REALM3D基准测试中,对于显性和隐性指令均实现了卓越的表现。此外,该框架还无缝支持包括对象移除、替换和风格迁移在内的多种3D交互任务。 Conclusion: REALM框架成功地将复杂的自然语言指令与精确的3D对象定位相结合,无需依赖大量的3D特定后训练,展示了其在处理复杂指令和执行多样化3D交互任务方面的实用性和灵活性。 Abstract: Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.[141] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
Xiaojun Guo,Runyu Zhou,Yifei Wang,Qi Zhang,Chenheng Zhang,Stefanie Jegelka,Xiaohan Wang,Jiajun Chai,Guojun Yin,Wei Lin,Yisen Wang
Main category: cs.CV
TL;DR: 提出SSL4RL框架,利用自监督学习任务作为可验证奖励,通过强化学习提升视觉-语言模型的视觉证据利用能力。
Details
Motivation: 现有视觉-语言模型在视觉中心任务中依赖语言先验,在推理中使用文本捷径,难以有效利用视觉证据;同时缺乏可扩展、可靠的强化学习奖励机制。 Method: 将自监督学习任务(如图像旋转预测、掩码块重建)转化为密集、自动的奖励信号,用于强化学习微调,无需人工标注或不可靠的AI评估器。 Result: 在视觉中心和视觉-语言推理基准上显著提升性能;通过消融实验识别影响效果的关键因素;并在图学习任务中验证了框架的通用性。 Conclusion: SSL4RL建立了一种通用且有效的多模态模型对齐范式,使用可验证的自监督目标作为奖励,为未来工作提供了新的设计原则。 Abstract: Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.[142] LightGlueStick: a Fast and Robust Glue for Joint Point-Line Matching
Aidyn Ubingazhibov,Rémi Pautrat,Iago Suárez,Shaohui Liu,Marc Pollefeys,Viktor Larsson
Main category: cs.CV
TL;DR: 提出了一种轻量级的点和线段匹配器LightGlueStick,通过注意力线消息传递(ALMP)机制实现了高效匹配,在多个基准上达到最先进性能。
Details
Motivation: 传统方法将点和线匹配视为独立任务,而现有联合匹配方法计算复杂度高,难以实现实时应用或部署到边缘设备。 Method: 基于图神经网络(GNN),设计了注意力线消息传递(ALMP)模块,显式建模线段间的连接性,提升匹配效率与效果。 Result: 在多个基准测试中,LightGlueStick在保持低计算复杂度的同时,实现了最先进的匹配精度。 Conclusion: LightGlueStick是一种高效、轻量化的点和线段联合匹配方法,适用于实时和边缘设备应用。 Abstract: Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at https://github.com/aubingazhib/LightGlueStick.[143] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
Haoran Sun,Chen Cai,Huiping Zhuang,Kong Aik Lee,Lap-Pui Chau,Yi Wang
Main category: cs.CV
TL;DR: 本文提出了可解释的深度伪造视频检测(EDVD)任务,并设计了基于大语言模型的EDVD-LLaMA多模态推理框架,通过时空细微信息令牌化和细粒度多模态思维链机制,实现高精度检测与可信的可追溯解释。
Details
Motivation: 传统深度伪造检测方法缺乏透明性和泛化能力,难以应对不断演进的伪造技术,亟需具备可验证推理能力的检测器。 Method: 提出ST-SIT模块提取融合全局与局部跨帧特征,并设计Fg-MCoT机制引入面部特征作为硬约束,实现像素级时空定位与可靠推理;构建ER-FF++set数据集支持双监督训练。 Result: 实验表明,EDVD-LLaMA在检测准确率、可解释性及跨伪造方法、跨数据集场景下均表现出卓越性能和强健性。 Conclusion: 该方法为深度伪造检测提供了更可解释、更可靠的解决方案,推动了可信赖AI在多媒体安全中的应用。 Abstract: The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.[144] RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
Kunyu Peng,Di Wen,Jia Fu,Jiamin Wu,Kailun Yang,Junwei Zheng,Ruiping Liu,Yufan Chen,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出了RefAtomNet++,一种用于语言引导的细粒度视频动作识别的新框架,在RefAVA++数据集上实现了最先进的性能。
Details
Motivation: 现有方法在跨模态对齐和目标人物定位方面能力有限,难以准确理解复杂场景中特定人物的原子级动作。 Method: 提出RefAtomNet++,结合多层次语义对齐的交叉注意力机制与多轨迹Mamba建模,在部分关键词、场景属性和整体句子层面进行跨模态标记聚合,并动态构建扫描轨迹以增强时空特征提取。 Result: 在RefAVA++数据集上实验表明,RefAtomNet++显著优于现有基线模型,取得了新的最先进结果。 Conclusion: RefAtomNet++通过多层次语义对齐和动态轨迹建模,有效提升了语言引导的原子动作识别性能。 Abstract: Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.[145] Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance
Chien Thai,Mai Xuan Trang,Huong Ninh,Hoang Hiep Ly,Anh Son Le
Main category: cs.CV
TL;DR: 本文提出了一种基于高斯边界框表示和Bhattacharyya距离的改进损失函数,用于提升旋转目标检测的精度和鲁棒性。
Details
Motivation: 传统目标检测方法在处理旋转目标时因无法有效捕捉方向变化而表现不佳,尤其在航拍图像、遥感和自动驾驶等应用中存在局限。 Method: 引入一种旋转不变的损失函数,采用各向异性高斯表示来解决类方形物体中各向同性方差的问题,并结合先进的深度学习旋转检测器进行验证。 Result: 实验表明,所提方法在平均精度均值(mAP)指标上显著优于现有方法。 Conclusion: 该方法有望成为旋转目标检测的新基准,适用于需要精确且可靠定位的各种应用场景。 Abstract: Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.[146] VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion
Jaekyun Park,Hye Won Chung
Main category: cs.CV
TL;DR: 提出了一种名为VIPAMIN的视觉提示初始化策略,通过在嵌入空间中对齐语义信息区域并引入新的表示方向,有效提升了自监督模型在少样本和挑战性任务下的适应性能。
Details
Motivation: 现有的视觉提示调优方法在应用于自监督骨干网络时,往往无法有效特化提示或丰富表示空间,尤其在数据稀缺和高挑战性任务中表现不佳。 Method: 提出VIPAMIN,通过两种机制增强提示初始化:一是将提示与嵌入空间中的语义信息区域对齐,二是注入超出预训练子空间的新表示方向;该方法仅需一次前向传播和轻量级操作。 Result: 在多种任务和数据集规模下,VIPAMIN consistently 提升了性能,尤其在小样本设置中表现突出,达到了视觉提示调优领域的新SOTA。 Conclusion: VIPAMIN是一种简单而有效的视觉提示初始化方法,显著增强了自监督模型在提示调优中的适应能力,为资源受限场景下的模型微调提供了新思路。 Abstract: In the era of large-scale foundation models, fully fine-tuning pretrained networks for each downstream task is often prohibitively resource-intensive. Prompt tuning offers a lightweight alternative by introducing tunable prompts while keeping the backbone frozen. However, existing visual prompt tuning methods often fail to specialize the prompts or enrich the representation space--especially when applied to self-supervised backbones. We show that these limitations become especially pronounced in challenging tasks and data-scarce settings, where effective adaptation is most critical. In this work, we introduce VIPAMIN, a visual prompt initialization strategy that enhances adaptation of self-supervised models by (1) aligning prompts with semantically informative regions in the embedding space, and (2) injecting novel representational directions beyond the pretrained subspace. Despite its simplicity--requiring only a single forward pass and lightweight operations--VIPAMIN consistently improves performance across diverse tasks and dataset sizes, setting a new state of the art in visual prompt tuning. Our code is available at https://github.com/iamjaekyun/vipamin.[147] Instance-Aware Pseudo-Labeling and Class-Focused Contrastive Learning for Weakly Supervised Domain Adaptive Segmentation of Electron Microscopy
Shan Xiong,Jiabao Chen,Ye Wang,Jialin Peng
Main category: cs.CV
TL;DR: 提出一种基于弱监督域适应的多任务学习框架,通过交叉教学机制和实例感知伪标签策略,有效利用稀疏点标注和未标记区域,显著提升电子显微图像中线粒体分割性能。
Details
Motivation: 现有无监督域适应方法在实际应用中性能较低,且完全标注成本高,需减少对密集标注的依赖,同时提升跨域分割效果。 Method: 采用弱监督域适应(WDA),引入多任务学习框架联合进行分割与中心检测,结合交叉教学机制、类聚焦的跨域对比学习,以及基于检测引导的实例感知伪标签(IPL)选择策略进行自训练。 Result: 在多个挑战性数据集上验证,该方法优于现有的UDA和WDA方法,显著缩小了与全监督上限的性能差距,并在UDA设置下也优于其他UDA技术。 Conclusion: 所提方法通过有效利用稀疏标注和未标注区域,实现了高性能、低标注成本的跨域线粒体实例分割,具有较强的实用性和推广潜力。 Abstract: Annotation-efficient segmentation of the numerous mitochondria instances from various electron microscopy (EM) images is highly valuable for biological and neuroscience research. Although unsupervised domain adaptation (UDA) methods can help mitigate domain shifts and reduce the high costs of annotating each domain, they typically have relatively low performance in practical applications. Thus, we investigate weakly supervised domain adaptation (WDA) that utilizes additional sparse point labels on the target domain, which require minimal annotation effort and minimal expert knowledge. To take full use of the incomplete and imprecise point annotations, we introduce a multitask learning framework that jointly conducts segmentation and center detection with a novel cross-teaching mechanism and class-focused cross-domain contrastive learning. While leveraging unlabeled image regions is essential, we introduce segmentation self-training with a novel instance-aware pseudo-label (IPL) selection strategy. Unlike existing methods that typically rely on pixel-wise pseudo-label filtering, the IPL semantically selects reliable and diverse pseudo-labels with the help of the detection task. Comprehensive validations and comparisons on challenging datasets demonstrate that our method outperforms existing UDA and WDA methods, significantly narrowing the performance gap with the supervised upper bound. Furthermore, under the UDA setting, our method also achieves substantial improvements over other UDA techniques.[148] NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation
Peiran Xu,Xicheng Gong,Yadong MU
Main category: cs.CV
TL;DR: 本文提出了一种面向目标的视觉-语言导航方法,通过Q-learning利用大规模无标签轨迹数据训练模型,以预测动作的长期影响,并结合历史信息与未来前景实现更有效的路径搜索。
Details
Motivation: 现有方法多基于历史信息做决策,忽视了动作的未来影响和长期结果,因此需要一种具有前瞻性的智能体来提升导航性能。 Method: 采用Q-learning框架,利用无标签轨迹数据训练Q模型,生成描述潜在未来信息的Q特征;通过跨模态未来编码器将Q特征与导航指令融合,产生反映未来前景的动作评分,并结合历史评分使用A*风格的搜索策略进行路径规划。 Result: 在多个主流的目标导向型VLN数据集上进行了广泛实验,结果表明所提方法能有效提升导航性能。 Conclusion: 该方法通过引入基于Q-learning的前瞻性建模,显著增强了智能体在复杂室内环境中的长距离导航能力,为视觉-语言导航提供了新的思路。 Abstract: In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.[149] HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
Haocheng Tang,Ruoke Yan,Xinhui Yin,Qi Zhang,Xinfeng Zhang,Siwei Ma,Wen Gao,Chuanmin Jia
Main category: cs.CV
TL;DR: 本文提出了一种名为HGC-Avatar的分层高斯压缩框架,用于高效传输和高质量渲染动态3D头像,通过引入人体先验和面部注意力机制,在低比特率下实现了优于现有方法的视觉质量和压缩效率。
Details
Motivation: 现有的基于通用3D高斯点阵的压缩方法在数字人编码与传输中缺乏人体先验知识,导致码率效率低下和解码质量不佳,难以应用于可流式传输的3D头像系统。 Method: 提出HGC-Avatar框架,将高斯表示分解为结构层(通过StyleUNet生成姿态相关的高斯)和运动层(利用SMPL-X模型紧凑表达时序姿态变化),并引入面部注意力机制以在低码率下保留面部细节。 Result: 实验结果表明,HGC-Avatar在视觉质量和压缩效率方面显著优于先前方法,支持逐层压缩、渐进式解码和从视频或文本等多源输入进行可控渲染。 Conclusion: HGC-Avatar为动态3D头像的流式传输和快速渲染提供了一个高效且高质量的解决方案,特别适用于沉浸式通信场景。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.[150] PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Lukas Selch,Yufang Hou,M. Jehanzeb Mirza,Sivan Doveh,James Glass,Rogerio Feris,Wei Lin
Main category: cs.CV
TL;DR: PRISMM-Bench是首个基于真实同行评审中发现的跨模态不一致问题的基准,用于评估大型多模态模型在科学论文理解中的可靠性,引入三项任务和结构化JSON答案表示以减少语言偏见,21个主流模型表现不佳,凸显科学推理的挑战。
Details
Motivation: 现有基准未能真实反映科学论文中跨文本、图表、公式等多模态不一致问题,且多依赖合成错误或单一模态,缺乏现实复杂性和领域特异性,导致无法有效评估多模态模型在科研场景下的可信推理能力。 Method: 通过审查挖掘、大语言模型辅助筛选和人工验证的多阶段流程,从242篇论文中整理出262个真实评审标记的不一致案例,构建PRISMM-Bench基准;设计不一致识别、修复和配对匹配三项任务,并采用结构化的JSON答案格式以减少选择题中的表面语言偏见。 Result: 在21个领先的多模态模型上进行评测,包括开源大模型(如GLM-4.5V、InternVL3)和闭源模型(如Gemini 2.5 Pro、GPT-5),结果显示模型性能普遍偏低(26.1%-54.2%),表明当前模型在跨模态科学推理方面存在显著不足。 Conclusion: PRISMM-Bench揭示了当前大型多模态模型在理解和推理科学论文中的多模态不一致问题上仍面临重大挑战,强调需进一步发展能支持可信赖科研辅助的多模态AI系统。 Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.[151] OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary Tasks
Franko Šikić,Sven Lončarić
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的新型缺货检测方法OOS-DSD,通过引入辅助学习(产品分割和深度估计)提升检测性能,结合YOLOv8架构与多分支结构,在mAP上超越现有最先进方法1.8%,并通过消融实验验证了辅助学习和深度归一化策略的有效性。
Details
Motivation: 缺货(OOS)检测在零售业中至关重要,但现有方法在准确性和鲁棒性方面仍有提升空间,因此需要引入更有效的上下文信息和多任务学习机制来增强检测性能。 Method: 在YOLOv8目标检测架构基础上增加两个卷积分支,分别用于产品分割和场景深度估计;其中深度估计分支使用Depth Anything V2生成的伪标签进行训练,并提出一种深度归一化方法以稳定训练过程。 Result: 所提方法在mAP指标上比现有最先进OOS检测方法高出1.8%;消融实验表明,辅助学习使mAP提升3.7%,提出的深度归一化方法进一步提升4.2%。 Conclusion: 通过引入产品分割和相对深度估计作为辅助任务,并结合有效的深度归一化策略,OOS-DSD显著提升了缺货检测的性能,验证了多任务辅助学习在零售视觉任务中的有效性。 Abstract: Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.[152] Image Categorization and Search via a GAT Autoencoder and Representative Models
Duygu Sap,Martin Lotz,Connor Mattinson
Main category: cs.CV
TL;DR: 提出了一种基于图注意力网络(GAT)自编码器的图像分类与检索方法,通过构建图像和类别的代表模型,利用图结构和上下文感知的潜在表示提升分类与检索性能。
Details
Motivation: 为了提升图像分类与检索的准确性,需要更好地建模图像间的相似性关系并提取上下文感知的特征表示。 Method: 构建图像图为节点、相似性为边的图结构,使用GAT自编码器学习上下文感知的图像嵌入,进而生成类别代表,并通过比较查询图像与类别代表进行分类与类别内检索。 Result: 实验表明,该代表性中心方法在图像分类和检索任务上优于标准特征基线方法,验证了GAT自编码器的有效性。 Conclusion: 基于GAT自编码器的代表性中心框架能有效捕捉图像间关系,提升图像分类与检索性能。 Abstract: We propose a method for image categorization and retrieval that leverages graphs and a graph attention network (GAT)-based autoencoder. Our approach is representative-centric, that is, we execute the categorization and retrieval process via the representative models we construct for the images and image categories. We utilize a graph where nodes represent images (or their representatives) and edges capture similarity relationships. GAT highlights important features and relationships between images, enabling the autoencoder to construct context-aware latent representations that capture the key features of each image relative to its neighbors. We obtain category representatives from these embeddings and categorize a query image by comparing its representative to the category representatives. We then retrieve the most similar image to the query image within its identified category. We demonstrate the effectiveness of our representative-centric approach through experiments with both the GAT autoencoders and standard feature-based techniques.[153] Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon,Kyle Min,Jy-yong Sohn
Main category: cs.CV
TL;DR: 本文提出了一种名为READ的微调方法,通过引入词级重建和句级对齐两个辅助目标,提升视觉-语言模型在组合推理任务上的表现。
Details
Motivation: 现有的对比学习训练的视觉-语言模型在组合推理上表现不佳,主要因为文本编码器关注单个词语而非其关系。 Method: 在对比学习基础上增加两个辅助目标:1)词级重建,使用冻结的预训练解码器重建替代字幕;2)句级对齐,显式对齐释义句子的嵌入表示。 Result: READ-CLIP在五个主流组合推理基准上达到最先进性能,比最强基线提升达4.1%,且在多种CLIP变体上均有效。 Conclusion: 重建与对齐目标能互补地增强模型对词语关系的理解和对不同表达形式的一致表征能力,显著提升组合推理性能。 Abstract: Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning -- the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives -- reconstruction and alignment -- offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.[154] Watch Where You Move: Region-aware Dynamic Aggregation and Excitation for Gait Recognition
Binyuan Huang,Yongdong Luo,Xianda Guo,Xiawu Zheng,Zheng Zhu,Jiahui Pan,Chengju Zhou
Main category: cs.CV
TL;DR: 提出了一种区域感知的动态聚合与激励框架(GaitRDAE),用于解决基于深度学习的步态识别中运动区域建模不充分的问题,实现了最先进的性能。
Details
Motivation: 现有方法使用预定义区域和固定时间尺度进行时序建模,难以适应不同运动区域的动态变化和特定模式,尤其在协变量影响外观时表现不佳。 Method: 设计了两个核心模块:区域感知动态聚合(RDA)模块,动态搜索每个区域的最佳时序感受野;区域感知动态激励(RDE)模块,增强对稳定行为模式区域的关注,抑制对易受协变量干扰的静态区域的关注。 Result: 实验结果表明,GaitRDAE在多个基准数据集上达到了最先进的性能。 Conclusion: GaitRDAE能够自适应地建模不同运动区域的时序特征并有效抑制协变量干扰,显著提升了步态识别的准确性。 Abstract: Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.[155] Fit for Purpose? Deepfake Detection in the Real World
Guangyu Lin,Li Lin,Christina P. Walker,Daniel S. Schiff,Shu Hu
Main category: cs.CV
TL;DR: 本文介绍了首个基于真实世界政治深度伪造事件数据库的系统性基准,评估了学术界、政府和工业界的最先进检测工具,发现现有模型在应对社交媒体上的真实政治深度伪造内容时普遍表现不佳,尤其易受视频领域简单操纵的影响,强调需构建更具政治语境化的检测框架。
Details
Motivation: 由于生成模型的发展,AI生成内容激增,政治深度伪造泛滥,威胁公众对政治机构的信任。然而现有检测模型多基于实验室数据,难以推广到真实社交平台场景,亟需针对现实世界政治深度伪造的评估与改进。 Method: 基于自2018年以来在社交媒体上传播的真实政治深度伪造案例数据库,系统评估了来自学术界、政府和工业界的前沿深度伪造检测技术,涵盖免费与付费工具,在多种操纵条件下测试其泛化能力。 Result: 学术界和政府开发的检测器表现较差;付费检测工具虽相对较好,但所有模型在应对真实政治深度伪造时均表现有限,尤其在视频中易受简单编辑操作干扰,缺乏跨场景泛化能力。 Conclusion: 当前深度伪造检测技术在真实政治语境下面临严峻挑战,需发展融合政治背景理解的、更具鲁棒性的检测框架,以提升现实环境中的防护能力。 Abstract: The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.[156] SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
Yiyang Huang,Liang Shi,Yitian Zhang,Yi Xu,Yun Fu
Main category: cs.CV
TL;DR: 本文首次将大视觉语言模型(LVLM)中的物体幻觉得出源于视觉编码器,并提出了一种无需训练的框架SHIELD,通过三种策略有效缓解了统计偏差、固有偏差和脆弱性问题,在多个基准上表现出色。
Details
Motivation: 大视觉语言模型在跨模态任务中表现优异,但存在物体幻觉问题。以往研究多关注语言模型部分,本文旨在从视觉编码器角度追溯并解决这一问题。 Method: 提出SHIELD框架,采用三种策略:重新加权视觉token以减少统计偏差,引入噪声派生token对抗固有偏差,结合对抗攻击与对比解码应对脆弱性。该方法无需训练。 Result: 实验表明,SHIELD在多种LVLM架构和基准测试上均能有效减轻物体幻觉,同时在通用LVLM基准上保持良好性能,展现出广泛适用性。 Conclusion: 视觉编码器是LVLM物体幻觉的重要来源,SHIELD作为一种训练-free的解决方案,能够有效缓解三类偏差,具有实际应用潜力。 Abstract: Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.[157] VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu,Yurui Zhu,Xin Lu,Wenrui Yan,Dong Li,Kunlin Liu,Xueyang Fu,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 提出VisionSelector,一种轻量级、可学习的视觉token选择框架,通过可微Top-K机制和课程退火策略实现高效、自适应的多模态大模型视觉token压缩。
Details
Motivation: 现有视觉token压缩方法依赖启发式规则,易丢失关键信息,存在注意力汇聚等偏差,难以在高压缩比下保持性能。 Method: 将token压缩建模为端到端可学习的决策过程,设计解耦于MLLM主干的VisionSelector模块,引入可微Top-K机制和课程退火策略以缩小训练与推理差距。 Result: 仅12.85M参数,在30%保留率下MME准确率达100%,10%保留率下超越先前方法12.14%,预填充速度提升一倍。 Conclusion: VisionSelector实现了跨压缩率的泛化能力与关键token的自适应识别,显著提升MLLM在高分辨率或多图像输入下的效率与性能。 Abstract: Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .[158] A Deep Learning Framework for Real-Time Image Processing in Medical Diagnostics: Enhancing Accuracy and Speed in Clinical Applications
Melika Filvantorkaman,Maral Filvan Torkaman
Main category: cs.CV
TL;DR: 提出一种基于深度学习的实时医学图像分析框架,结合U-Net、EfficientNet和Transformer等模型与模型剪枝、量化及GPU加速技术,实现在X射线、CT和MRI等多种模态下的高效诊断,具备高精度、低延迟和可解释性,支持边缘设备到云端的灵活部署,并与PACS、EHR系统无缝集成。
Details
Motivation: 医学影像解读耗时且存在临床医生间变异性,传统图像处理方法在精度、鲁棒性和速度上难以满足实时临床需求。 Method: 整合U-Net、EfficientNet和基于Transformer的神经网络架构,结合模型剪枝、量化和GPU加速等实时优化策略,构建可在边缘设备、本地服务器和云平台部署的深度学习框架,并集成Grad-CAM和分割叠加等可视化解释工具。 Result: 在公开基准数据集上实现超过92%的分类准确率、91%以上的分割Dice分数,推理时间低于80毫秒,具备出色的计算效率和模型性能。 Conclusion: 该框架能显著加快诊断流程,减轻临床医生负担,提升AI在时间敏感型医疗环境中的可信部署与临床实用性。 Abstract: Medical imaging plays a vital role in modern diagnostics; however, interpreting high-resolution radiological data remains time-consuming and susceptible to variability among clinicians. Traditional image processing techniques often lack the precision, robustness, and speed required for real-time clinical use. To overcome these limitations, this paper introduces a deep learning framework for real-time medical image analysis designed to enhance diagnostic accuracy and computational efficiency across multiple imaging modalities, including X-ray, CT, and MRI. The proposed system integrates advanced neural network architectures such as U-Net, EfficientNet, and Transformer-based models with real-time optimization strategies including model pruning, quantization, and GPU acceleration. The framework enables flexible deployment on edge devices, local servers, and cloud infrastructures, ensuring seamless interoperability with clinical systems such as PACS and EHR. Experimental evaluations on public benchmark datasets demonstrate state-of-the-art performance, achieving classification accuracies above 92%, segmentation Dice scores exceeding 91%, and inference times below 80 milliseconds. Furthermore, visual explanation tools such as Grad-CAM and segmentation overlays enhance transparency and clinical interpretability. These results indicate that the proposed framework can substantially accelerate diagnostic workflows, reduce clinician workload, and support trustworthy AI integration in time-critical healthcare environments.[159] Self-Supervised Learning to Fly using Efficient Semantic Segmentation and Metric Depth Estimation for Low-Cost Autonomous UAVs
Sebastian Mocanu,Emil Slusanschi,Marius Leordeanu
Main category: cs.CV
TL;DR: 本文提出了一种仅依赖视觉的室内小型无人机自主飞行系统,结合语义分割与单目深度估计实现避障、探索与安全降落,无需GPS或LiDAR。通过自适应尺度因子算法将非度量深度转为精确距离(平均误差14.4 cm),并采用知识蒸馏训练轻量U-Net网络实现实时语义分割。在真实与数字孪生环境中测试表明该方法提升航程、缩短任务时间且保持100%成功率,进一步通过端到端学习实现87.5%的自主任务成功率。
Details
Motivation: 解决小型无人机在无GPS、资源受限的室内环境中对昂贵传感器(如LiDAR)的依赖问题,实现高效、低成本的自主导航。 Method: 结合语义分割与单目深度估计,提出自适应尺度因子算法利用语义地面检测和相机内参将单目深度转化为度量距离;采用知识蒸馏框架,用SVM或先进模型作为教师网络训练轻量U-Net学生网络进行实时分割;基于最优方法生成演示数据,通过端到端学习训练紧凑神经网络执行飞行策略。 Result: 在5x4米实验室环境(含8个纸板障碍物)中完成30次真实飞行和100次数字孪生飞行测试,组合方法显著增加探测距离、减少任务时间,保持100%任务成功率;端到端学习版本实现87.5%自主任务成功率;单目深度转换平均距离误差为14.4 cm。 Conclusion: 该研究推进了结构化环境中基于视觉的无人机导航,解决了度量深度估计与计算效率难题,验证了在资源受限平台部署高性能自主飞行系统的可行性。 Abstract: This paper presents a vision-only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non-metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color-based Support Vector Machine (SVM) teacher generates training data for a lightweight U-Net student network (1.6M parameters) capable of real-time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state-of-the-art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real-world environment and 100 flight tests in a digital-twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end-to-end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best-performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision-based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource-constrained platforms.[160] MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Young-Jun Lee,Byung-Kwan Lee,Jianshu Zhang,Yechan Hwang,Byungsoo Ko,Han-Gyu Kim,Dongyu Yao,Xuankun Rong,Eojin Joo,Seung-Ho Han,Bowon Ko,Ho-Jin Choi
Main category: cs.CV
TL;DR: MultiVerse是一个新的多轮对话基准,包含647个平均四轮的对话,涵盖484个任务和目标,用于评估视觉-语言模型在复杂对话场景中的表现。
Details
Motivation: 现有数据集未能充分覆盖真实世界中复杂的多轮对话场景,需要一个更全面的基准来评估VLM的多轮交互能力。 Method: 从12个流行的VLM评估基准中构建MultiVerse,提出基于清单的自动化评估方法,使用GPT-4o作为评估器,衡量37个关键方面的性能。 Result: 在18个VLM上的实验表明,即使最强的模型(如GPT-4o)在复杂多轮对话中也仅达到50%的成功率;提供完整对话上下文显著提升较小模型的表现。 Conclusion: MultiVerse是一个具有挑战性的多轮对话评估基准,凸显了当前VLM在复杂对话中的局限性,并强调了上下文学习的重要性。 Abstract: Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.[161] Structured Interfaces for Automated Reasoning with 3D Scene Graphs
Aaron Ray,Jacob Arkin,Harel Biggie,Chuchu Fan,Luca Carlone,Nicholas Roy
Main category: cs.CV
TL;DR: 本文提出一种基于检索增强生成的方法,利用图数据库和Cypher查询语言作为大语言模型与3D场景图之间的接口,以更高效地实现自然语言的语义接地,显著提升了在大规模、丰富场景下的任务性能并减少了token消耗。
Details
Motivation: 现有的将3D场景图序列化为文本输入大语言模型的方法难以扩展到大型或复杂的场景图,限制了自然语言理解的效率与可扩展性。 Method: 采用检索增强生成(RAG)策略,将3D场景图存储在图数据库中,并通过Cypher查询语言作为工具供大语言模型调用,动态检索与当前任务相关的子图信息。 Result: 在指令跟随和场景问答任务中,相比传统的上下文窗口和代码生成基线方法,该方法在本地和云端模型上均表现出更好的可扩展性、更高的任务性能以及显著降低的token使用量。 Conclusion: 使用Cypher作为3D场景图的查询接口,能够有效提升大语言模型在复杂场景中的自然语言接地能力,具备良好的实用性和扩展性。 Abstract: In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.[162] Universal and Transferable Attacks on Pathology Foundation Models
Yuntian Wang,Xilin Yang,Che-Yung Shen,Nir Pillar,Aydogan Ozcan
Main category: cs.CV
TL;DR: 提出了一种通用且可迁移的对抗扰动UTAP,用于揭示病理学基础模型中的关键漏洞,能够在多种模型和数据集上导致性能显著下降。
Details
Motivation: 揭示病理学基础模型在面对对抗扰动时的脆弱性,推动更鲁棒模型的发展。 Method: 通过深度学习优化生成一个固定的弱噪声模式,该模式可广泛应用于不同视野和数据分布的病理图像,破坏模型特征表示能力。 Result: UTAP在多个最先进的病理基础模型和数据集上实现了显著的性能下降,同时具备跨模型和跨数据集的可迁移性与通用性。 Conclusion: UTAP构成对多种病理基础模型的广泛威胁,为模型鲁棒性评估提供了高标准基准,强调了发展防御机制和对抗训练的重要性。 Abstract: We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.[163] HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications
Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield
Main category: cs.CV
TL;DR: 本文提出了一种新的光谱重建方法HYDRA,通过结合知识蒸馏与多尺度注意力机制,在从未见过的场景中从自然图像恢复高光谱图像,实现了最先进的性能提升和更快的推理速度。
Details
Motivation: 现有的多尺度注意力方法在处理具有数百个通道的现代高光谱图像时,泛化能力有限,难以实现高质量的光谱重建。 Method: 提出HYDRA架构,采用教师-学生模型框架:教师模型编码潜在的高光谱信息,学生模型学习从自然图像到该编码空间的映射,并引入一种新颖的训练方法。 Result: 在多个指标上达到SOTA性能,精度提升18%,且在不同通道深度下均具有比当前SOTA模型更快的推理速度。 Conclusion: HYDRA有效解决了现有光谱重建模型在泛化性和效率上的局限性,为高光谱成像在计算机视觉中的应用提供了更强的可行性。 Abstract: Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher's encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18\% boost in accuracy, and faster inference times than current SOTA models at various channel depths.[164] Pursuing Minimal Sufficiency in Spatial Reasoning
Yejie Guo,Yunzhong Hou,Wufei Ma,Meng Tang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出了一种名为MSSR的双代理框架,通过构建最小充分集(MSS)来提升视觉语言模型在3D空间推理中的表现,实现了当前最优的性能,并生成可解释的推理路径。
Details
Motivation: 现有视觉语言模型在空间推理方面存在两个瓶颈:由于2D为中心的预训练导致的3D理解能力不足,以及冗余3D信息引发的推理失败。 Method: 构建一个最小充分集(MSS),使用感知代理从专家模型中提取必要的3D感知结果,并引入SOG模块进行方向定位;推理代理则通过闭环迭代精炼信息,去除冗余并补充缺失,以形成MSS。 Result: 实验表明,该方法在两个具有挑战性的基准上显著提高了准确性,达到了最先进的性能,同时生成了可解释的推理路径。 Conclusion: MSSR通过显式追求信息的充分性和最小性,有效提升了VLMs的空间推理能力,为未来模型提供了高质量的训练数据来源。 Abstract: Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.[165] SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation
Huy Minh Nhat Nguyen,Triet Hoang Minh Dao,Chau Vinh Hoang Truong,Cuong Tuan Nguyen
Main category: cs.CV
TL;DR: 提出了一种名为SDPA++的自监督去噪框架,利用仅含噪声的OCT图像生成伪真实标签,通过基于块的策略训练去噪模型,在无配对干净图像的情况下有效提升图像质量。
Details
Motivation: 由于光学相干断层扫描(OCT)图像固有的散斑噪声和临床环境中难以获取配对的干净与噪声图像,传统监督去噪方法面临数据瓶颈,因此需要一种无需干净参考图像的高效自监督去噪方法。 Method: SDPA++框架首先通过对噪声OCT图像进行自融合和自监督去噪生成伪真实图像,然后采用基于图像块的聚合策略,利用这些伪标签训练多个去噪模型,实现图像质量增强。 Result: 在IEEE SPS VIP Cup真实世界OCT数据集上验证了方法的有效性,通过对比噪声比(CNR)、均方比(MSR)、纹理保持(TP)和边缘保持(EP)等指标显示出性能提升,且该数据集无干净参考图像,证明了方法的实用性。 Conclusion: SDPA++是一种通用且有效的自监督去噪框架,能够在缺乏干净图像的情况下显著提升OCT图像质量,具有在临床实践中改善诊断效果的潜力。 Abstract: Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method's potential for improving image quality and diagnostic outcomes in clinical practice.[166] Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization
Tianxin Wei,Yifan Chen,Xinrui He,Wenxuan Bao,Jingrui He
Main category: cs.CV
TL;DR: 本文提出了一种新的对比学习范式——域连接对比学习(DCCL),通过增强类内跨域连通性来提升无监督域泛化性能,在五个标准基准上超越了现有方法。
Details
Motivation: 直接应用对比学习在域泛化中表现不佳,原因是缺乏类内连通性,尤其是在训练和测试样本存在分布偏移的情况下。 Method: 提出DCCL方法:在数据层面引入更强的数据增强和跨域正样本;在模型层面采用模型锚定和生成变换损失,利用预训练表示中的类内连通性以更好嵌入未见域。 Result: 在五个标准域泛化基准上的实验表明,DCCL在无需域监督的情况下优于当前最先进方法。 Conclusion: 通过增强跨域类内连通性,DCCL有效提升了对比学习在域泛化中的表现,验证了连通性对学习可泛化表示的重要性。 Abstract: Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through https://github.com/weitianxin/DCCL[167] HumanCM: One Step Human Motion Prediction
Liu Haojie,Gao Suixiang
Main category: cs.CV
TL;DR: HumanCM是一种基于一致性模型的单步人体运动预测框架,通过学习噪声与干净运动状态之间的自洽映射,实现高效生成,显著减少推理步骤的同时保持甚至提升预测精度。
Details
Motivation: 现有的扩散模型依赖多步去噪过程,导致推理效率低,难以满足实时应用需求,因此需要一种更高效的单步生成方法进行人体运动预测。 Method: 提出HumanCM框架,采用基于Transformer的时空架构和时间嵌入来建模长程依赖并保持运动连贯性,通过训练一致性模型实现从噪声到清晰运动的单步映射。 Result: 在Human3.6M和HumanEva-I数据集上的实验表明,HumanCM在预测精度上达到或优于当前最先进的扩散模型,同时将推理步数减少了近两个数量级。 Conclusion: HumanCM通过一致性模型实现了高效、准确的人体运动预测,为实际应用中的实时动作生成提供了可行方案。 Abstract: We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.[168] Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Xiongkun Linghu,Jiangyong Huang,Ziyu Zhu,Baoxiong Jia,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出了一种用于3D场景理解的新型链式思维推理框架SCENECOT,并构建了首个大规模3D场景链式推理数据集SCENECOT-185K,实现了在复杂3D场景问答任务中的高接地一致性与优异性能。
Details
Motivation: 现有3D大语言模型在实现基于场景和物体的接地问答方面仍存在困难,主要因为缺乏对类人场景-物体接地推理机制的深入探索。 Method: 提出SCENECOT方法,将复杂的推理任务分解为更简单的问题,并利用多模态专家模块生成对应的视觉线索;同时构建包含18.5万高质量样本的大规模数据集SCENECOT-185K以支持该方法。 Result: 在多个复杂3D场景推理基准上的实验表明,该框架显著提升了接地问答的一致性和整体性能,首次成功将链式思维推理应用于3D场景理解。 Conclusion: SCENECOT实现了类人的逐步推理,为3D大语言模型的接地推理提供了新范式,并具有向更广泛3D场景理解任务扩展的潜力。 Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.[169] Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
Jianbiao Mei,Yu Yang,Xuemeng Yang,Licheng Wen,Jiajun Lv,Botian Shi,Yong Liu
Main category: cs.CV
TL;DR: 提出了一种隐式残差世界模型IR-WM,专注于建模环境的当前状态和演变,通过仅预测基于自车动作和场景上下文的变化(残差),在nuScenes基准上实现了4D占据预测和轨迹规划的最优性能。
Details
Motivation: 现有视觉为中心的世界模型常因完全重建未来场景而浪费计算资源于静态背景,导致效率低下。 Method: IR-WM首先从视觉观测构建当前状态的鸟瞰图表示,利用前一时刻的BEV特征作为时间先验,仅预测受自车动作和场景上下文影响的‘残差’变化,并引入对齐模块减轻语义与动态错位,抑制误差累积。同时探索了不同预测-规划耦合方式。 Result: 在nuScenes数据集上,IR-WM在4D占据预测和轨迹规划任务中均达到领先性能。 Conclusion: IR-WM通过聚焦动态变化建模并引入残差机制与对齐策略,有效提升世界模型效率与规划精度,适用于端到端自动驾驶系统。 Abstract: End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.[170] UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid
Tianyang Dou,Ming Li,Jiangying Qin,Xuan Liao,Jiageng Zhong,Armin Gruen,Mengyi Deng
Main category: cs.CV
TL;DR: 提出了一种名为UKANFormer的新语义分割模型,用于在噪声标签监督下实现高精度珊瑚礁制图,实验表明其性能优于基线方法,并能生成比训练标签更准确的预测结果。
Details
Motivation: 现有全球珊瑚礁分布产品(如Allen Coral Atlas)在空间精度和语义一致性上存在局限,尤其在精细边界划分方面表现不足,因此需要一种能在噪声监督下实现高精度映射的方法。 Method: 基于UKAN架构,引入全局-局部Transformer(GL-Trans)模块到解码器中,以同时捕捉全局语义结构和局部边界细节,从而提升在噪声标签下的分割性能。 Result: UKANFormer在珊瑚类别上的IoU达到67.00%,像素精度为83.98%,且生成的预测在视觉和结构上比训练所用的噪声标签更精确。 Conclusion: 模型结构设计可在一定程度上克服低质量标注数据的限制,为缺乏可靠标签的生态监测任务提供了可行的解决方案。 Abstract: Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.[171] A Comprehensive Survey on World Models for Embodied AI
Xinqing Li,Xin He,Le Zhang,Yun Liu
Main category: cs.CV
TL;DR: 本文提出了一种用于具身AI的统一世界模型框架,形式化了问题设置与学习目标,并提出了涵盖功能、时间建模和空间表示的三轴分类法,系统梳理了数据资源与评估指标,并对前沿模型进行了定量比较,总结了关键挑战。
Details
Motivation: 为了支持感知、预测和决策,具身AI需要能够模拟环境动态的内部世界模型,但现有方法缺乏统一的框架和评估标准。 Method: 提出一个三轴分类法:功能性(决策耦合vs通用)、时间建模(序列模拟推断vs全局差异预测)、空间表示(全局隐变量向量、特征序列、空间隐网格、分解渲染表示),并系统整理数据集与评估指标。 Result: 建立了统一的世界模型框架,系统梳理了机器人、自动驾驶和视频领域的数据资源与度量标准,对SOTA模型进行了定量比较,并指出了当前在数据集统一性、物理一致性评估、计算效率与长时序一致性方面的挑战。 Conclusion: 该综述为具身AI中的世界模型研究提供了系统性框架与分类体系,揭示了未来发展方向,包括构建统一数据集、改进评估指标以强调物理合理性,以及提升模型在长期预测中的稳定性与效率。 Abstract: Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at https://github.com/Li-Zn-H/AwesomeWorldModels.[172] Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
Erik Riise,Mehmet Onurcan Kaya,Dim P. Papadopoulos
Main category: cs.CV
TL;DR: 本文研究了在图像生成中应用推理时搜索策略的挑战,发现视觉自回归模型的离散性和序列性使其能够有效进行搜索,使用波束搜索显著提升了文本到图像生成的效果,使20亿参数的自回归模型超越了120亿参数的扩散模型。研究表明模型架构对推理时优化至关重要。
Details
Motivation: 尽管通过搜索实现的推理时扩展已彻底改变了大型语言模型,但在图像生成领域却难以取得类似进展。近期尝试将搜索策略应用于连续扩散模型效果有限,因此需要探索更有效的图像生成方法。 Method: 利用视觉自回归模型的离散、序列特性,采用波束搜索(beam search)策略进行文本到图像生成,并通过系统性消融实验和验证器分析评估其性能。 Result: 波束搜索显著提升了文本到图像生成质量,2B参数的自回归模型在多个基准上超过了12B参数的扩散模型;离散token空间支持早期剪枝和计算复用,带来了性能优势。 Conclusion: 模型架构(如离散序列建模)对于推理时优化至关重要,而不仅仅是模型规模;视觉自回归模型为高效搜索提供了有利条件,有望推动图像生成的发展。 Abstract: While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.[173] Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution
Ivan Molodetskikh,Kirill Malyshev,Mark Mirgaleev,Nikita Zagainov,Evgeney Bogatyrev,Dmitriy Vatolin
Main category: cs.CV
TL;DR: 本文提出了一种基于人类感知显著性的图像超分辨率(SR)伪影评估新方法,构建了包含1302个伪影样本的数据集,并训练了一个轻量级回归模型生成伪影显著性热图。
Details
Motivation: 随着SR模型能力增强,其生成的伪影问题日益突出,但现有方法将伪影视为统一的二元缺陷,忽略了其在人类视觉中显著性的差异,因此需要一种更符合人眼感知的伪影评估方式。 Method: 收集了来自11种现代SR方法的1302个伪影样本,通过众包方式获取每个伪影的人类感知显著性评分,并基于此数据集训练了一个轻量级回归器以生成空间显著性热图。 Result: 所训练的回归模型在检测显著性伪影方面优于现有方法,并能输出空间热图,揭示伪影在图像中的分布与显著程度。 Conclusion: 伪影应根据其对人类观察者的显著性来评估,而非简单视为二元缺陷;该研究为SR伪影的感知-aware 评估与缓解提供了有效工具和公开数据支持。 Abstract: Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.[174] WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement
Shengyu Zhu,Fan,Fuxuan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为WaMaIR的新框架,结合全局多尺度小波变换卷积(GMWTConvs)、基于Mamba的通道感知模块(MCAM)和多尺度纹理增强损失(MTELoss),有效提升了图像恢复中纹理细节的重建效果,在性能和计算效率上均优于现有方法。
Details
Motivation: 现有的CNN方法受限于感受野小和缺乏通道特征建模,难以充分恢复图像的精细纹理细节。 Method: 引入GMWTConvs以扩大感受野并提取多尺度特征;设计MCAM模块捕捉通道间的长程依赖关系;提出MTELoss损失函数以增强纹理结构的保留。 Result: 实验表明,WaMaIR在多个图像恢复任务上优于当前最先进的方法,显著提升纹理恢复质量,同时保持较高的计算效率。 Conclusion: WaMaIR通过结合大感受野设计、通道注意力机制与专用损失函数,有效解决了CNN在图像恢复中纹理细节丢失的问题,为高效高质量图像恢复提供了新思路。 Abstract: Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.[175] Region in Context: Text-condition Image editing with Human-like semantic reasoning
Thuy Phuong Vu,Dinh-Cuong Hoang,Minhhuy Le,Phan Xuan Tan
Main category: cs.CV
TL;DR: 提出了一种名为Region in Context的新框架,通过多级语义对齐实现文本条件下的图像编辑,提升了编辑的连贯性和与指令的一致性。
Details
Motivation: 现有方法在基于文本进行图像区域编辑时忽视了局部区域与整体场景的关系,导致编辑结果不一致或缺乏全局协调性。 Method: 引入双层引导机制:区域级描述与包含全图上下文的表示对齐,同时利用大视觉语言模型生成的场景级描述对齐整幅图像,实现局部修改与全局结构的协同优化。 Result: 实验表明该方法在生成结果的连贯性和指令对齐方面优于现有方法,能产生更自然、一致的编辑效果。 Conclusion: Region in Context通过显式的语言参考和多层级对齐机制,有效提升了文本驱动图像编辑中局部与全局的一致性。 Abstract: Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git[176] EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
Mingzheng Zhang,Jinfeng Gao,Dan Xu,Jiangrui Yu,Yuhan Qiao,Lan Chen,Jin Tang,Xiao Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba网络和参数高效微调的X射线医学报告生成框架EMRRG,通过改进视觉骨干和解码器结构,在多个基准数据集上取得了优异性能。
Details
Motivation: 现有医学报告生成模型多依赖大语言模型,缺乏对预训练视觉基础模型和先进微调技术的探索,尤其是跨注意力机制的增强以及非Transformer架构的应用。 Method: 采用SSM-based Mamba网络作为视觉骨干,将X射线图像分块编码并提取特征,使用Partial LoRA进行参数高效微调;结合混合解码器的大语言模型实现端到端的报告生成。 Result: 在三个主流X射线医学报告生成基准数据集上进行了广泛实验,验证了所提方法的有效性,表现出优越的生成性能。 Conclusion: EMRRG框架展示了Mamba架构在医学报告生成中的潜力,证明了参数高效微调与非Transformer视觉骨干结合的优势,为该领域提供了新的研究方向。 Abstract: X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.[177] GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation
Junbo Li,Weimin Yuan,Yinuo Wang,Yue Zeng,Shihao Shu,Cai Meng,Xiangzhi Bai
Main category: cs.CV
TL;DR: 提出了一种新的6D物体姿态估计方法GS2POSE,通过扩展3DGS的能力并利用李代数构建可微渲染管线,迭代优化姿态,并在多个数据集上实现了精度提升。
Details
Motivation: 现有方法在处理无纹理物体和光照变化时存在困难,需要更鲁棒的姿态估计方法。 Method: 基于Bundle Adjustment原理设计姿态回归算法,利用李代数构建3DGS的可微渲染管线,通过比较输入图像与渲染图像迭代优化姿态,并更新3DGS中的颜色参数以适应光照变化。 Result: 在T-LESS、LineMod-Occlusion和LineMod数据集上分别比先前方法提高了1.4%、2.8%和2.5%的精度。 Conclusion: GS2POSE有效提升了在挑战性环境下的6D姿态估计精度,具有更强的鲁棒性和适应性。 Abstract: Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4\%, 2.8\% and 2.5\% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.[178] Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
Shihao Ji,Zihui Song
Main category: cs.CV
TL;DR: 提出了一种无需训练的视频理解框架,通过结合预训练视觉语言模型和经典机器学习算法,实现零样本的视频内容结构化分析与多模态摘要生成。
Details
Motivation: 现有的视频理解模型依赖大量标注数据进行任务特定训练,成本高且扩展性差,而大模型在静态图像上的零样本推理能力尚未有效迁移到视频领域。 Method: 将视频理解重构为高维语义特征空间中的自监督时空聚类问题:利用冻结的预训练VLM视觉编码器提取语义特征轨迹,采用核时间分割(KTS)进行事件分段,并通过无监督密度聚类识别重复出现的宏观场景,最后结合VLM生成文本描述。 Result: 实现了无需训练的视频结构化分析,能自动划分语义连贯的事件片段并生成多模态摘要,在多种视频上验证了方法的有效性和可解释性。 Conclusion: 该框架提供了一种高效、可解释且模型无关的零样本视频理解路径,成功融合了预训练VLM的语义先验与经典机器学习的模式发现能力。 Abstract: The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.[179] Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs
Jiazhen Liu,Long Chen
Main category: cs.CV
TL;DR: LENS是一种即插即用的方法,通过附加轻量级模块为冻结的多模态大模型提供分割能力,无需微调,保持模型泛化性的同时实现优异性能。
Details
Motivation: 现有的多模态大语言模型在引入分割能力时通常需要微调,改变了输出空间并损害了模型的统一性和泛化能力,因此需要一种不依赖微调的解决方案。 Method: 提出LENS方法,附加一个可训练的轻量头模块到完全冻结的MLLM上,利用注意力图中的空间线索提取关键点,并生成与掩码解码器兼容的点特征。 Result: 实验表明,LENS在分割任务上的性能达到或优于需重新训练的方法,同时完全保留了MLLM原有的泛化能力。 Conclusion: LENS提供了一种高效且强大的范式,能够在不破坏模型统一性的前提下扩展多模态大模型的像素级理解能力,推动真正多功能统一模型的发展。 Abstract: Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.[180] Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry
Sara Hatami Rostami,Behrooz Nasihatkon
Main category: cs.CV
TL;DR: 提出了一种完全无监督的二值道路分割方法,利用场景几何和时间线索生成弱标签并优化分割结果,在Cityscapes数据集上达到0.82 IoU。
Details
Motivation: 减少对昂贵人工标注数据的依赖,实现可扩展的自动驾驶道路分割。 Method: 首先基于几何先验生成弱标签(如地平线以上为非道路,车辆前方预定义四边形为道路),然后通过跨帧特征点跟踪和互信息最大化来增强时间一致性,优化标签。 Result: 在Cityscapes数据集上实现了82%的IoU,表现出高精度和良好的时间稳定性。 Conclusion: 结合几何约束和时间一致性的方法在无监督道路分割中具有巨大潜力,适用于大规模自动驾驶应用。 Abstract: This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.[181] Personalized Image Filter: Mastering Your Photographic Style
Chengxuan Zhu,Shuchen Weng,Jiacong Fang,Peixuan Zhang,Si Li,Chao Xu,Boxin Shi
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练文本到图像扩散模型的个性化图像滤波器(PIF),利用生成先验和文本反转技术来学习并迁移摄影作品风格,在保持内容图像内容的同时有效提取和转换多种摄影风格。
Details
Motivation: 现有方法在从参考图像中学习有意义的摄影概念或保持内容图像的内容方面存在不足,需要一种能够更好理解和迁移摄影风格的方法。 Method: 基于预训练的文本到图像扩散模型,采用生成先验和文本反转技术,通过优化摄影概念的文本提示来学习参考图像的摄影风格。 Result: PIF在提取和迁移各种摄影风格方面表现出色,能有效保留内容图像的内容。 Conclusion: PIF能够高效地学习和迁移摄影风格,为个性化图像滤波提供了新的解决方案。 Abstract: Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: https://pif.pages.dev/[182] An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting
Zhenpeng Zhang,Yi Wang,Shanglei Chai,Yingying Liu,Zekai Xie,Wenhao Huang,Pengyu Li,Zipei Luo,Dajiang Lu,Yibin Tian
Main category: cs.CV
TL;DR: 本文构建了一个高质量、公开的荔枝图像数据集,用于支持基于视觉的采摘机器人开发,包含多种品种和成熟度的荔枝图像,并进行了详细的标注与统计分析。
Details
Motivation: 目前缺乏在自然生长环境下、具有一致性和全面标注的开源荔枝数据集,限制了基于视觉的采摘机器人发展。 Method: 采集了不同天气、时段和品种(如糯米糍、妃子笑等)下的RGB图像,并通过数据增强和深度图像扩展数据集;三名人员独立标注,由第四人审核以提高一致性。 Result: 构建了包含11,414张图像的数据集,涵盖三个成熟阶段,含878张原始RGB图、8,780张增强图和1,756张深度图,共9,658组检测与成熟度分类标注;并通过三种深度学习模型验证了数据集有效性。 Conclusion: 该数据集填补了现有空白,为荔枝检测与成熟度分类研究提供了可靠资源,已公开供学术使用。 Abstract: Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic[183] ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
Yahia Battach,Abdulwahab Felemban,Faizan Farooq Khan,Yousef A. Radwan,Xiang Li,Fabio Marchese,Sara Beery,Burton H. Jones,Francesca Benzoni,Mohamed Elhoseiny
Main category: cs.CV
TL;DR: ReefNet是一个大规模公开的珊瑚礁图像数据集,包含约92.5万个专家验证的属级硬珊瑚点标签,并与海洋物种世界登记册(WoRMS)关联,支持细粒度、全球范围的珊瑚分类与领域泛化评估。
Details
Motivation: 珊瑚礁因气候变化等人类活动迅速退化,亟需可扩展、自动化的监测手段;现有数据集在规模、地理覆盖或标签精细度上存在局限,且缺乏机器学习就绪的标注。 Method: 整合来自76个CoralNet来源及红海Al Wajh站点的图像,构建带细粒度分类标签的 ReefNet 数据集;提出两种评估设置:源内划分用于局部评估,跨源划分用于测试领域泛化能力;评估监督学习与零样本分类性能。 Result: 监督模型在源内表现良好,但在跨源场景下性能显著下降;所有零样本模型表现较差,尤其对稀有和视觉相似的珊瑚属;揭示了当前方法在领域泛化和细粒度识别上的挑战。 Conclusion: ReefNet提供了一个具有挑战性的基准,旨在推动领域自适应和细粒度珊瑚分类技术的发展,促进全球珊瑚礁监测与保护。 Abstract: Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.[184] Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction
Abdur Rahman,Mohammad Marufuzzaman,Jason Street,Haifeng Wang,Veera G. Gude,Randy Buchanan
Main category: cs.CV
TL;DR: 本研究提出了一种名为AdaptMoist的域适应方法,利用五种纹理特征预测木材碎片的含水量,显著提高了跨域预测准确性。
Details
Motivation: 现有含水量预测方法在面对不同来源的木材碎片时,因数据分布变化而准确性下降,缺乏鲁棒性。 Method: 从木材碎片图像中提取五种不同的纹理特征,结合使用,并提出AdaptMoist域适应方法,利用调整后的互信息准则进行模型保存。 Result: 组合纹理特征达到95%的准确率;AdaptMoist方法使跨域预测准确率平均达到80%,相比非适应模型的57%提升了23%。 Conclusion: AdaptMoist是一种有效的跨域木材碎片含水量预测方案,具有在依赖木材碎片的工业中应用的潜力。 Abstract: Accurate and quick prediction of wood chip moisture content is critical for optimizing biofuel production and ensuring energy efficiency. The current widely used direct method (oven drying) is limited by its longer processing time and sample destructiveness. On the other hand, existing indirect methods, including near-infrared spectroscopy-based, electrical capacitance-based, and image-based approaches, are quick but not accurate when wood chips come from various sources. Variability in the source material can alter data distributions, undermining the performance of data-driven models. Therefore, there is a need for a robust approach that effectively mitigates the impact of source variability. Previous studies show that manually extracted texture features have the potential to predict wood chip moisture class. Building on this, in this study, we conduct a comprehensive analysis of five distinct texture feature types extracted from wood chip images to predict moisture content. Our findings reveal that a combined feature set incorporating all five texture features achieves an accuracy of 95% and consistently outperforms individual texture features in predicting moisture content. To ensure robust moisture prediction, we propose a domain adaptation method named AdaptMoist that utilizes the texture features to transfer knowledge from one source of wood chip data to another, addressing variability across different domains. We also proposed a criterion for model saving based on adjusted mutual information. The AdaptMoist method improves prediction accuracy across domains by 23%, achieving an average accuracy of 80%, compared to 57% for non-adapted models. These results highlight the effectiveness of AdaptMoist as a robust solution for wood chip moisture content estimation across domains, making it a potential solution for wood chip-reliant industries.[185] From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display
Xiangyu Mu,Dongliang Zhou,Jie Hou,Haijun Zhang,Weili Guan
Main category: cs.CV
TL;DR: 本文提出了一种名为M2HVideo的框架,用于从模特视频生成可控身份、逼真的真人服装展示视频,解决了头部与身体运动错位及身份漂移问题。
Details
Motivation: 现有的模特展示缺乏真实感和细节表现力,无法满足在线时尚展示对高质量视频的需求,因此需要一种能够将模特视频转换为逼真人类穿着视频的方法。 Method: 提出M2HVideo框架,包含动态姿态感知头部编码器、像素空间中的镜像损失(基于DDIM去噪)以及分布感知适配器,以实现身份保持和时序一致性。 Result: 在UBC、ASOS和自建MannequinVideos数据集上的实验表明,该方法在服装一致性、身份保持和视频质量方面优于现有最先进方法。 Conclusion: M2HVideo有效实现了从模特到人类的视频生成,在身份控制、细节保留和时序连贯性方面表现出色,具有应用于在线时尚展示的潜力。 Abstract: Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.[186] 2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting
Haofan Ren,Qingsong Yan,Ming Lu,Rongfeng Lu,Zunjie Zhu
Main category: cs.CV
TL;DR: 提出2DGS-R方法,通过分层训练策略在几乎不增加存储和训练时间的情况下,显著提升渲染质量并保持几何精度。
Details
Motivation: 现有的3D高斯点阵难以准确表示表面,而2DGS虽提升了几何保真度但牺牲了渲染质量,单一训练阶段难以兼顾两者。 Method: 采用分层训练策略:首先用法向一致性正则化训练原始2D高斯;然后选择渲染质量不足的区域进行原位克隆增强;最后在冻结不透明度的情况下微调模型。 Result: 相比原始2DGS仅增加1%存储和极少训练时间,实现了高质量渲染与精细几何结构的保留。 Conclusion: 2DGS-R有效平衡了效率与性能,在视觉保真度和几何重建精度上均有提升。 Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.[187] ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
Akhila Kambhatla,Taminul Islam,Khaled R Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种轻量级基于Transformer的语义分割框架ArmFormer,用于实现像素级武器检测,结合CBAM与MixVisionTransformer架构,在保持高效计算的同时达到最先进的分割性能。
Details
Motivation: 传统武器检测方法仅提供粗略的边界框定位,缺乏精细分割能力;现有语义分割模型在精度与效率之间难以平衡,难以部署于边缘设备。 Method: 提出ArmFormer框架,采用CBAM增强的编码器主干和集成注意力机制的hamburger解码器,结合Convolutional Block Attention Module与MixVisionTransformer,实现多类武器(手枪、步枪、刀、左轮、人)的精确分割。 Result: 实验显示ArmFormer在5类武器分割任务中达到80.64% mIoU和89.13% mFscore,实现实时82.26 FPS推理速度,仅需4.886G FLOPs和3.66M参数,优于计算量高达其48倍的重型模型。 Conclusion: ArmFormer在精度与效率之间取得了优异平衡,是适用于便携式安防摄像头、监控无人机和嵌入式AI加速器等边缘设备的理想武器分割解决方案。 Abstract: The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.[188] BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation
Shujian Gao,Yuan Wang,Zekuan Yu
Main category: cs.CV
TL;DR: 本文提出了一种名为BARL的半监督医学图像分割框架,通过在标签空间和表示空间中同时进行双边对齐,提升了模型的性能。
Details
Motivation: 现有的半监督医学图像分割方法主要关注标签空间的一致性,忽略了表示空间的对齐,导致特征学习不够判别性和空间一致性。 Method: 提出了BARL框架,包含双路径正则化(DPR)和渐进式认知偏差校正(PCBC)用于标签空间对齐,并通过区域级和病灶实例匹配实现表示空间对齐。 Result: 在四个公开数据集和一个私有CBCT数据集上的实验表明,BARL consistently 超过现有最先进方法,消融实验验证了各组件的有效性。 Conclusion: BARL通过联合优化标签和表示空间的对齐,显著提升了半监督医学图像分割的性能,具有良好的应用前景。 Abstract: Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.[189] Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection
Yuyang Yu,Zhengwei Chen,Xuemiao Xu,Lei Zhang,Haoxin Yang,Yongwei Nie,Shengfeng He
Main category: cs.CV
TL;DR: 提出一种注册引导的旋转不变特征提取框架,将点云配准与基于记忆的异常检测相结合,通过联合优化对齐和表示学习,提升3D点云异常检测的鲁棒性和判别能力。
Details
Motivation: 现有基于记忆库的方法在特征变换一致性、局部几何细节捕捉和旋转不变性方面存在不足,尤其在配准失败时检测结果不可靠。 Method: 将特征提取嵌入到配准学习过程中,通过联合优化点云配准和异常检测目标,实现旋转不变且具有局部判别性的特征表示。 Result: 在Anomaly-ShapeNet和Real3D-AD数据集上实验表明,该方法在检测效果和泛化性上均优于现有方法。 Conclusion: 注册引导的特征提取能有效提升3D点云异常检测的性能,验证了配准与表示学习结合的重要性。 Abstract: 3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.[190] Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding
Yudan Ren,Xinlong Wang,Kexin Wang,Tian Xia,Zihan Ma,Zhaowei Li,Xiangrong Bi,Xiao Li,Xiaowei He
Main category: cs.CV
TL;DR: 提出了一种新的神经元级别分析框架,通过结合人工神经网络(AN)的细粒度分析和基于fMRI的体素编码,研究视觉-语言模型(VLMs)中的多模态信息处理机制,揭示了AN与生物神经元(BN)在功能网络、冗余性、极性模式及架构影响方面的相似性。
Details
Motivation: 当前对人工神经网络(ANNs)与人类大脑处理之间的相似性的理解有限:单模态ANN研究未能捕捉到大脑固有的多模态处理能力,而多模态ANN研究主要关注高层模型输出,忽略了单个神经元的关键作用。因此需要一个新框架来弥补这些不足。 Method: 结合细粒度的人工神经元(AN)分析与基于fMRI的体素编码方法,分析两种结构不同的视觉-语言模型(CLIP和METER),从神经元层面探究其多模态信息处理机制。 Result: 发现:(1)AN能成功预测多个功能网络中BN的活动;(2)AN和BN均表现出功能冗余性;(3)AN展现出与BN平行的极性模式;(4)不同架构导致不同的脑相似特性,CLIP具有模态特异性,METER则实现跨模态统一激活。 Conclusion: 该研究为视觉-语言模型在神经元水平上存在类脑分层处理提供了有力证据,并强调了模型架构对类脑特性的重要影响。 Abstract: While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.[191] Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis
Nusrat Munia,Abdullah Imran
Main category: cs.CV
TL;DR: 提出了一种分类诱导的扩散模型Class-N-Diff,用于同时生成和分类皮肤镜图像,通过在扩散模型中集成分类器来指导基于类别条件的图像生成,提高了生成图像的真实性和多样性,并增强了下游诊断任务的性能。
Details
Motivation: 传统类别条件生成模型在准确生成特定医学类别的图像方面存在困难,限制了其在皮肤癌诊断等应用中的实用性。 Method: 在扩散模型中集成一个分类器,以基于类别条件引导图像生成过程,实现更精确的类别条件图像合成。 Result: Class-N-Diff模型生成的皮肤镜图像更加真实和多样化,且集成的分类器在下游诊断任务中表现出更好的性能。 Conclusion: Class-N-Diff通过分类器与扩散模型的结合,显著提升了医学图像生成的质量和应用价值,是一种增强扩散模型生成皮肤镜图像实用性的有效方法。 Abstract: Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at https://github.com/Munia03/Class-N-Diff.[192] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li,Zheyuan Liu,Qihui Zhang,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Yang Ye,Wangbo Yu,Yuwei Niu,Li Yuan
Main category: cs.CV
TL;DR: 提出Edit-R1,一种基于策略优化的指令式图像编辑后训练框架,结合DiffusionNFT和MLLM作为奖励模型,在多个基准上实现SOTA性能。
Details
Motivation: 现有监督微调方法容易过拟合于标注模式,缺乏泛化能力,难以适应训练分布外的编辑任务。 Method: 采用无需似然的策略优化方法DiffusionNFT,兼容高阶采样器;利用多模态大语言模型(MLLM)作为无需训练的统一奖励模型,并设计低方差分组过滤机制以降低评分噪声。 Result: UniWorld-V2在ImgEdit和GEdit-Bench上分别达到4.49和7.83的SOTA成绩,且框架可应用于Qwen-Image-Edit和FLUX-Kontext等多种基模型,提升其性能。 Conclusion: Edit-R1通过结合扩散模型与MLLM奖励反馈,实现了高效、通用的指令驱动图像编辑,具备良好的扩展性和实用性。 Abstract: Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.[193] Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data
Ramon Dalmau,Gabriel Jarry,Philippe Very
Main category: cs.CV
TL;DR: 本文提出了一种基于地面相机观测的卷云归因框架,通过高时空分辨率图像与飞行及气象数据结合,实现对飞机尾迹来源的准确追踪。
Details
Motivation: 由于卫星观测在时空分辨率上的限制,难以精确将观测到的尾迹归因于具体航班,而地面相机能在尾迹形成初期提供清晰图像,因此需要一种更有效的归因方法。 Method: 利用地面可见光相机尾迹序列(GVCCS)数据集,构建模块化框架,结合多种几何表示、距离度量、时间平滑和概率分配策略,将观测尾迹与理论尾迹进行匹配。 Result: 该框架能够有效关联地面观测到的尾迹与生成它们的航班,提供了高精度的归因能力,并支持未来研究的扩展。 Conclusion: 所提出的模块化框架为尾迹-航班归因建立了强有力的基线,有助于改进非二氧化碳气候效应的评估。 Abstract: Aviation's non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation's CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.[194] Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation
Akhila Kambhatla,Ahmed R Khaled
Main category: cs.CV
TL;DR: 本研究探讨了基于Transformer的模型在热成像武器分割中的应用,比较了SegFormer、DeepLabV3+、SegNeXt和Swin Transformer在自建热成像数据集上的性能,结果表明Transformer架构在该任务中表现出色,具有良好的准确率-速度权衡和泛化能力。
Details
Motivation: 热成像武器分割在低光和遮挡环境下对安防监控至关重要,但现有CNN方法在长距离依赖和细节捕捉上存在局限,而Vision Transformers在RGB图像分割中表现优异,其在热成像领域的潜力尚待挖掘。 Method: 采用四种基于Transformer的架构(SegFormer、DeepLabV3+、SegNeXt、Swin Transformer),在包含9711张真实监控视频提取并由SAM2自动标注的热成像图像数据集上进行二分类武器分割实验,使用MMSegmentation框架和标准数据增强策略进行训练与评估。 Result: SegFormer-b5取得最高mIoU(94.15%)和像素准确率(97.04%);SegFormer-b0实现最快推理速度(98.32 FPS)且mIoU达90.84%;SegNeXt-mscans兼顾速度(85.12 FPS)与性能(92.24% mIoU);DeepLabV3+ R101-D8达到92.76% mIoU但速度较慢(29.86 FPS)。 Conclusion: 基于Transformer的模型在热成像武器分割任务中显著优于传统CNN方法,具备强大的全局上下文建模能力和良好的实际部署灵活性,适用于多种实时安全应用场景。 Abstract: Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3\+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15\%) and Pixel Accuracy (97.04\%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84\%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24\% mIoU, and DeepLabV3\+ R101-D8 reaches 92.76\% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.[195] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang
Main category: cs.CV
TL;DR: 本文提出了Res-Bench,一个用于评估多模态大语言模型在不同输入分辨率下性能稳定性的新基准,包含14,400个样本和多种鲁棒性度量指标。
Details
Motivation: 现有评估范式主要关注语义性能,忽视了模型在不同分辨率下的鲁棒性问题。 Method: 构建了一个涵盖12种分辨率和六大能力维度的Res-Bench基准,并提出Spearman相关性、ACE/RCE等新指标来衡量性能稳定性。 Result: 对主流MLLM进行了大规模评估,分析了模型和任务层面的鲁棒性、预处理策略影响以及微调对稳定性的提升作用。 Conclusion: Res-Bench能够有效评估MLLM在动态分辨率下的鲁棒性,揭示了当前模型在分辨率变化时的性能波动问题。 Abstract: Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.[196] Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis
Praveenbalaji Rajendran,Mojtaba Safari,Wenfeng He,Mingzhe Hu,Shansong Wang,Jun Zhou,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文综述了基础模型(FMs)在医学图像分析中的最新进展,系统地分类了视觉和视觉-语言FMs,并通过元分析揭示了数据集使用和应用领域的趋势,同时讨论了领域适应、高效微调等挑战及未来研究方向。
Details
Motivation: 尽管FMs在医学影像中迅速发展,但该领域仍分散且缺乏统一的综述来系统梳理架构、训练范式和临床应用的演进。 Method: 对现有研究进行结构化分类(视觉-only与视觉-语言FMs),并开展定量元分析以揭示时间趋势、数据集使用和应用领域分布。 Result: 明确了FMs在医学图像分析中的主流架构与应用趋势,总结了领域适应、计算限制和可解释性等关键挑战,并提出了联邦学习、知识蒸馏和先进提示等应对策略。 Conclusion: 未来的研究应聚焦于提升FMs的鲁棒性、可解释性及临床整合能力,以加速其在真实医疗场景中的转化与应用。 Abstract: Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.[197] One-step Diffusion Models with Bregman Density Ratio Matching
Yuanzhi Zhu,Eleftherios Tsonis,Lucas Degeorge,Vicky Kalogeiton
Main category: cs.CV
TL;DR: 本文提出了Di-Bregman框架,将扩散蒸馏统一为基于Bregman散度的密度比匹配,具有理论一致性,并在CIFAR-10和文生图任务中表现出优于反向KL蒸馏的一次步生成性能。
Details
Motivation: 扩散模型生成质量高但采样慢,现有蒸馏方法缺乏统一的理论基础,因此需要一个理论扎实且高效的单步生成蒸馏框架。 Method: 提出Di-Bregman框架,从凸分析视角将扩散蒸馏建模为Bregman散度下的密度比匹配,统一解释多种现有目标函数。 Result: 在CIFAR-10和文本到图像生成任务上,Di-Bregman在单步生成中FID优于反向KL蒸馏,且视觉保真度接近教师模型。 Conclusion: 基于Bregman散度的密度比匹配为高效单步扩散生成提供了一条兼具理论基础和实践性能的有效路径。 Abstract: Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.[198] CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
Junhao Zhao,Zishuai Liu,Ruili Fang,Jin Lu,Linghan Zhang,Fei Dou
Main category: cs.CV
TL;DR: 提出了一种名为CARE的端到端框架,通过序列-图像对比对齐(SICA)和分类联合优化,实现事件触发传感器流下的日常活动识别,显著提升了鲁棒性和性能。
Details
Motivation: 现有方法在表示层面存在局限:基于序列的方法缺乏空间感知,而基于图像的方法丢失细粒度时间动态,且简单融合未能有效对齐两种表示。 Method: 结合时间感知、抗噪的序列编码与空间感知、频率敏感的图像表示,引入序列-图像对比对齐(SICA)和交叉熵联合优化目标,实现跨表示对齐与判别性学习。 Result: 在三个CASAS数据集上达到最优性能(Milan 89.8%,Cairo 88.9%,Kyoto7 73.3%),并对传感器故障和布局变化表现出强鲁棒性。 Conclusion: CARE有效融合序列与图像表示的优势,通过对比对齐提升ADL识别的准确性与可靠性,适用于实际智能家庭环境。 Abstract: The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.[199] Training-free Online Video Step Grounding
Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci
Main category: cs.CV
TL;DR: 本文提出了一种无需训练且在线的视频步骤定位方法,利用大型多模态模型(LMMs)实现零样本预测,并引入贝叶斯过滤思想开发了BaGLM框架,在三个数据集上优于现有的离线训练方法。
Details
Motivation: 标准视频步骤定位方法依赖标注数据和全视频离线处理,成本高且难以应用于需要实时决策的场景,因此本文探索无需训练且支持在线推理的方法。 Method: 利用大型多模态模型(LMMs)对局部帧进行零样本步骤预测,并结合大语言模型提取的步骤转移矩阵和步骤进度估计,基于贝叶斯过滤原理构建BaGLM框架以融合历史帧信息。 Result: 实验表明,所提在线无训练方法已优于离线训练模型,BaGLM在三个数据集上进一步提升了性能,展现出更强的视频步骤定位能力。 Conclusion: BaGLM实现了高效、无需训练的在线视频步骤定位,通过结合LMM的零样本能力和贝叶斯推理机制,为实际应用提供了更灵活、更强大的解决方案。 Abstract: Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.[200] An empirical study of the effect of video encoders on Temporal Video Grounding
Ignacio M. De la Jara,Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Felipe Bravo-Marquez
Main category: cs.CV
TL;DR: 本文提出了一项关于不同视频特征对经典架构影响的实证研究,通过在三个基准数据集上使用基于CNN、时序推理和Transformer的视频编码器提取特征,揭示了不同特征对模型性能的影响及潜在互补性。
Details
Motivation: 现有研究集中在少数视频表示方法上,可能导致长期的架构过拟合问题。 Method: 在Charades-STA、ActivityNet-Captions和YouCookII三个基准上,使用基于CNN、时序推理和Transformer的视频编码器提取特征,并评估其在经典架构上的表现。 Result: 实验结果表明,仅更换视频编码器就会显著影响模型性能,并揭示了特定特征带来的模式与错误。 Conclusion: 不同视频特征对模型性能有显著影响,且存在潜在互补性,建议未来研究应更广泛地探索多样化视频表示。 Abstract: Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.[201] Do Satellite Tasks Need Special Pretraining?
Ani Vanyan,Alvard Barseghyan,Hakob Tamazyan,Tigran Galstyan,Vahan Huroyan,Naira Hovakimyan,Hrant Khachatrian
Main category: cs.CV
TL;DR: 本文探讨了专用遥感基础模型是否优于通用视觉基础模型,结果表明在ViT-B规模下,专用预训练模型并未持续优于通用模型。
Details
Motivation: 由于遥感图像具有独特特性、特定应用和鲁棒性需求,研究者开发了专用基础模型,但本文质疑其相对于通用视觉模型的优势。 Method: 设计了一个衡量模型对低分辨率图像泛化能力的基准,并在百万级卫星图像数据集MillionAID上用针对遥感的改进方法训练iBOT自监督视觉编码器。 Result: 实验显示,在ViT-B规模下,这些专用预训练模型在两个下游任务中均未带来一致的性能提升。 Conclusion: 至少在小规模情况下,专用遥感基础模型并未显著优于通用视觉基础模型。 Abstract: Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.[202] Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Shraman Pramanick,Effrosyni Mavroudi,Yale Song,Rama Chellappa,Lorenzo Torresani,Triantafyllos Afouras
Main category: cs.CV
TL;DR: 本文提出了ED-VTG,一种利用多模态大语言模型进行细粒度视频时间定位的方法,通过两阶段流程实现自然语言查询在视频中的精准定位。
Details
Motivation: 现有方法在视频时间定位中难以有效处理细粒度查询,且易受幻觉和噪声影响,需要更强大的多模态理解与上下文建模能力。 Method: 提出两阶段方法:首先将语言查询转化为包含更多细节的增强句子,然后使用轻量解码器基于上下文化的查询表示预测准确的时间边界;采用多实例学习目标动态选择最优查询版本以减少噪声和幻觉。 Result: 在多个视频时间定位和段落定位基准上达到最先进的性能,显著优于此前基于大语言模型的方法,并在零样本场景下展现出明显优势。 Conclusion: ED-VTG通过结合多模态大语言模型与上下文感知的查询增强,在视频时间定位任务中实现了高效且鲁棒的性能,尤其在零样本设置下表现突出。 Abstract: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.[203] Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding
Yutong Zhong
Main category: cs.CV
TL;DR: 本文提出了一种名为What-Where Representation Re-Forming (W2R2)的新训练框架,旨在解决视觉语言模型在多模态3D定位中存在的“2D语义偏见”问题,通过解耦表示学习和针对性抑制捷径学习,显著提升了复杂环境下的3D定位精度和鲁棒性。
Details
Motivation: 现有视觉语言模型过度依赖2D图像特征进行粗略定位,忽视3D几何信息,导致多模态融合性能不佳,存在严重的“2D语义偏见”。 Method: 提出W2R2框架,将2D特征用作“What”识别的语义灯塔,3D特征作为“Where”定位的空间锚点,通过解耦表示学习实现精确3D定位;引入双目标损失函数,包括用于多模态协同监督的对齐损失和通过基于边界的机制惩罚2D主导伪输出的伪标签损失。 Result: 在ScanRefer和ScanQA数据集上的实验表明,W2R2显著提高了定位准确性和模型鲁棒性,尤其在杂乱的室外场景中表现突出。 Conclusion: W2R2有效缓解了视觉语言模型中的2D语义偏见,在不改变推理架构的前提下,通过重新塑造模型内部表征空间,实现了更优的多模态3D接地性能。 Abstract: Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.[204] Conditional Synthetic Live and Spoof Fingerprint Generation
Syed Konain Abbas,Sandip Purnapatra,M. G. Sarwar Murshed,Conor Miller-Lynch,Lambert Igene,Soumyabrata Dey,Stephanie Schuckers,Faraz Hussain
Main category: cs.CV
TL;DR: 本文提出了一种基于条件生成对抗网络(StyleGAN2-ADA、StyleGAN3和CycleGAN)生成高分辨率合成活体与伪造指纹图像的新方法,用于解决真实指纹数据集在隐私、成本和可获取性方面的限制。
Details
Motivation: 大规模真实指纹数据收集耗时昂贵且涉及隐私问题,同时缺乏多样化的伪造样本用于活体检测研究,因此需要高质量、隐私安全的合成指纹数据。 Method: 采用条件StyleGAN2-ADA和StyleGAN3生成按手指身份分类的高分辨率合成活体指纹,再利用CycleGAN将其转换为多种材质(如EcoFlex、Play-Doh)的逼真伪造指纹。构建了两个包含1,500枚指纹及对应八类材质伪造样本的合成数据集(DB2和DB3)。 Result: StyleGAN3模型FID低至5,TAR达99.47%(FAR=0.01%);StyleGAN2-ADA的TAR为98.67%。NFIQ2和MINDTCT评估显示图像质量良好,匹配实验未发现显著身份泄露。 Conclusion: 所提出的生成方法能高效创建高质量、多样化的合成指纹数据,在保证隐私安全的同时具备良好的生物特征可用性,适用于指纹识别与活体检测系统的训练与评估。 Abstract: Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fr\'echet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.[205] Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
Mohammad R. Salmanpour,Sonya Falahati,Amir Hossein Pouria,Amin Mousavi,Somayeh Sadat Mehrnia,Morteza Alizadeh,Arman Gorji,Zeinab Farsangi,Alireza Safarian,Mehdi Maghsudi,Carlos Uribe,Arman Rahmim,Ren Yuan
Main category: cs.CV
TL;DR: 本研究开发了一种医生参与的深度学习管道,用于提升肺癌CT图像分割的可重复性、预后准确性和临床信任度,结果显示VNet结合半监督学习表现最佳。
Details
Motivation: 肺癌是癌症死亡的主因,CT影像在筛查和治疗中至关重要,但手动分割耗时且变异大,现有深度学习方法临床应用受限,因此需要一种更可靠、可信的自动化方案。 Method: 基于知识到行动框架,使用来自12个公开数据集的999名患者多中心CT数据,比较五种深度学习模型(3D Attention U-Net、ResUNet、VNet、ReconNet、SAM-Med3D),评估其在全图和点击裁剪图像上的分割性能;采用497个放射组学特征评估可重复性,并比较38种降维方法和24种分类器在监督与半监督学习下的预后预测能力;六名医师从七个维度对分割结果进行定性评估。 Result: VNet表现最优(Dice=0.83,IoU=0.71),放射组学稳定性高(平均相关性=0.76,ICC=0.65),半监督学习下预测准确率最高(准确率=0.88,F1=0.83);半监督学习整体优于监督学习;放射科医生更偏好VNet的瘤周表示和边界质量,并倾向于使用AI生成的初始掩码进行人工修正而非完全替代。 Conclusion: 结合VNet与半监督学习的医生参与式深度学习流程可实现准确、可重复且临床可信的肺癌CT预后预测,为以医生为中心的人工智能临床转化提供了可行路径。 Abstract: Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.[206] Person Re-Identification via Generalized Class Prototypes
Md Ahmed Al Muzaddid,William J. Beksi
Main category: cs.CV
TL;DR: 本文提出了一种用于行人重识别的广义类代表选择方法,超越了传统的质心表示,通过灵活调整每类的表示数量,在准确率和平均精度之间取得平衡,显著提升了多个嵌入模型上的重识别性能。
Details
Motivation: 选择更好的类代表在行人重识别中是一个被忽视但关键的研究方向,现有方法多依赖类质心表示,在检索阶段表现次优,限制了性能提升。 Method: 提出一种广义的类代表选择方法,不局限于类质心,允许根据应用需求调整每类使用的表示数量,并将其应用于多种重识别嵌入模型之上。 Result: 该方法在多个基准嵌入模型上均显著提升重识别性能,改善了准确率和平均精度,超越当前主流结果。 Conclusion: 灵活选择类代表能有效提升行人重识别性能,所提方法具有通用性和实用性,为后续研究提供了新的优化方向。 Abstract: Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results[207] Video Reasoning without Training
Deepak Sridhar,Kartikeya Bhardwaj,Jeya Pradha Jeyaraj,Nuno Vasconcelos,Ankita Nayak,Harris Teague
Main category: cs.CV
TL;DR: 本文提出V-Reason,一种无需强化学习或监督微调的视频推理方法,通过熵信号调节大多少模态模型在推理时的思维过程,提升微探索与微利用行为,显著减少输出令牌并接近强化学习模型的准确率。
Details
Motivation: 现有视频推理方法依赖高成本的强化学习和冗长的思维链,计算开销大,且对模型思维过程的控制机制有限,因此需要一种高效、无需训练即可优化推理过程的方法。 Method: 利用模型输出的熵作为信号,观察高质量模型在推理过程中的微探索与微利用行为,并设计一个可训练的小型控制器,在推理时通过熵目标函数优化LMM的value缓存,从而动态调整模型行为。 Result: V-Reason在多个视频推理数据集上显著优于基础指令调优模型,平均准确率仅比RL训练模型低0.6%,同时输出令牌减少58.6%,大幅提升推理效率。 Conclusion: 通过熵驱动的推理过程调控,可在无需任何训练的情况下有效提升大多少模态模型的视频推理性能与效率,为高效推理提供了新范式。 Abstract: Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.[208] How Universal Are SAM2 Features?
Masoud Khairi Atani,Alon Harell,Hyomin Choi,Runyu Yang,Fabien Racape,Ivan V. Bajic
Main category: cs.CV
TL;DR: 本文比较了通用视觉模型Hiera与专用分割模型SAM2的特征通用性,发现SAM2在空间相关任务上表现优异,但在概念差异较大的任务(如姿态估计、图像描述)上性能下降,揭示了专业化带来的语义信息损失及层级适应导致的表征瓶颈。
Details
Motivation: 探讨通用基础视觉模型与专用模型之间的权衡,理解特征编码设计中的专业化代价。 Method: 通过轻量级可训练neck对冻结的Hiera和SAM2编码器特征进行探测,并采用信息论方法量化专业化成本,提出跨neck分析方法研究各层级的表征瓶颈。 Result: SAM2在深度估计等空间相关任务上优于Hiera,但在姿态估计和图像描述等任务上表现较差;其每层适配均加剧表征瓶颈,导致通用语义信息减少。 Conclusion: 模型专业化提升特定任务性能的同时牺牲了特征的通用性,需在下游应用中权衡效率与适应性,为高效特征编码设计提供了量化依据。 Abstract: The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.[209] ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding
Zhe Luo,Wenjing Jia,Stuart Perry
Main category: cs.CV
TL;DR: 本文提出了一种名为ProDAT的密度感知尾部丢弃机制,用于实现单模型下的点云渐进编码,在多个数据集上实现了优于现有学习方法的编码效率。
Details
Motivation: 现有的基于学习的点云编码方法缺乏对渐进解码的支持,难以满足低延迟和资源受限场景的需求。 Method: 提出ProDAT,利用密度信息作为指导信号,自适应地解码潜在特征和坐标,实现多码率下的渐进解码。 Result: 在SemanticKITTI和ShapeNet数据集上分别实现了超过28.6%和18.15%的BD-rate提升,支持渐进解码且编码效率更高。 Conclusion: ProDAT有效解决了学习型点云编码中无法渐进解码的问题,兼顾高效性与实用性,适用于资源受限环境。 Abstract: Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet[210] Towards a Generalizable Fusion Architecture for Multimodal Object Detection
Jad Berjawi,Yoann Dupas,Christophe C'erin
Main category: cs.CV
TL;DR: 本文提出了一种名为FMCAF的多模态目标检测融合架构,通过频域滤波和跨模态注意力机制提升RGB与红外图像的融合效果,具有良好的泛化能力,在VEDAI和LLVIP数据集上显著优于传统融合方法。
Details
Motivation: 为了在复杂条件下提高多模态目标检测的鲁棒性,并解决现有方法缺乏泛化能力的问题。 Method: 提出Filtered Multi-Modal Cross Attention Fusion (FMCAF),包含频域滤波模块(Freq-Filter)去除冗余频谱特征,以及基于交叉注意力的融合模块(MCAF)增强模态间特征共享,且无需针对特定数据集调整。 Result: 在VEDAI数据集上比传统融合方法提升13.9% mAP@50,在LLVIP上提升1.1% mAP@50,验证了其有效性与跨场景适用性。 Conclusion: FMCAF是一种具有强泛化能力的多模态融合框架,有望成为未来鲁棒目标检测系统的通用基础组件。 Abstract: Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.[211] GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation
Ruitong Gan,Junran Peng,Yang Liu,Chuanchen Luo,Qing Li,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 本文提出了GSPlane,通过引入平面先验信息来提升高斯点阵在重建平面区域时的几何精度和平滑度,同时生成结构清晰的网格连接,并支持对支撑平面上物体的解耦与灵活操作。
Details
Motivation: 现有的高斯点阵方法在重建平面区域时常缺乏足够的平滑性和精度,难以满足场景编辑和物理仿真等下游应用的需求。 Method: 利用现成的分割和法向预测模型提取平面先验,构建结构化的平面高斯表示以引导训练过程;引入动态高斯重分类器,将梯度持续较高的平面高斯自适应地重新分类为非平面;并利用优化后的平面先验 refine 网格布局。 Result: 实验表明,在不牺牲渲染质量的前提下,GSPlane显著提升了提取网格的几何精度,改善了拓扑结构,并减少了顶点和面片数量。 Conclusion: GSPlane有效增强了高斯点阵在平面区域的重建能力,实现了更准确、结构更优的三维场景表达,适用于需要结构化平面表示的应用场景。 Abstract: Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.[212] Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement
Xiaogang Xu,Jian Wang,Yunfan Lu,Ruihang Chu,Ruixing Wang,Jiafei Wu,Bei Yu,Liang Lin
Main category: cs.CV
TL;DR: 提出一种新的优化策略,用于在预训练扩散模型中增强条件控制,通过引入潜在细化管道和动态交互机制,显著提升低光照图像恢复的内容保真度。
Details
Motivation: 现有基于预训练扩散模型的方法在追求感知真实感时往往牺牲内容保真度,尤其在低光照条件下因信息严重退化导致控制效果不佳。 Method: 设计了一个包含生成先验的潜在细化管道以恢复VAE编码过程中丢失的空间细节,并实现条件潜变量与噪声潜变量之间的双向动态交互。 Result: 实验表明,该方法在多个低层次视觉任务中显著提升了内容保真度,同时保持了良好的视觉真实感和美学质量。 Conclusion: 所提方法作为一种即插即用模块,可有效集成到现有扩散网络中,增强了对生成过程的控制能力,解决了低光照下图像恢复中的保真度损失问题。 Abstract: Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.[213] Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras
Hodaka Kawachi,Tomoya Nakamura,Hiroaki Santo,SaiKiran Kumar Tedla,Trevor Dalton Canham,Yasushi Yagi,Michael S. Brown
Main category: cs.CV
TL;DR: 本文提出了一种利用LED环境照明为消费级相机生成视觉上不可见水印的方法,通过优化LED光源的光谱轮廓,在保证人眼几乎无法察觉的同时,使普通相机能够高效检测到水印。
Details
Motivation: 为了在不干扰人类视觉的前提下,实现对图像内容的版权保护和来源验证,需要一种隐蔽且可靠的水印嵌入方式。 Method: 该方法联合考虑人眼对可见光谱的敏感性、消费级相机传感器的光谱响应特性以及窄带LED生成宽带‘白光’(如D65照明)的能力,采用光谱调制而非强度调制来确保水印的不可见性,并支持在标准帧率下提取水印。 Result: 能够在10秒视频片段中嵌入128比特信息,且水印对人眼不可察觉,但在普通消费级相机中仍可有效检测和提取。 Conclusion: 该方法实现了在常见拍摄条件下对视觉质量无影响的隐蔽水印嵌入,适用于隐私保护和内容真实性验证等应用。 Abstract: This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.[214] GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Xin Gao,Jiyao Liu,Guanghao Li,Yueming Lyu,Jianxiong Gao,Weichen Yu,Ningsheng Xu,Liang Wang,Caifeng Shan,Ziwei Liu,Chenyang Si
Main category: cs.CV
TL;DR: 提出了一种名为GOOD的新框架,通过双层次引导扩散采样轨迹生成多样化且可控的OOD样本,显著提升OOD检测性能。
Details
Motivation: 现有基于文本到图像扩散模型的OOD样本生成方法存在语义不稳定和分布偏移多样性不足的问题,限制了在真实场景中的泛化能力。 Method: GOOD框架结合图像级引导(基于对数配分函数梯度)和特征级引导(基于分类器潜在空间中的k近邻距离),直接引导扩散过程朝向低密度和特征稀疏区域,并引入统一的自适应OOD评分机制。 Result: 实验表明,使用GOOD生成的样本进行训练能显著提升OOD检测性能,在定量和定性分析中均表现出更强的鲁棒性和多样性。 Conclusion: GOOD通过双层次引导机制实现了更可控、更多样化的OOD样本生成,有效增强了模型对真实OOD数据的检测能力。 Abstract: Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.[215] $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen
Main category: cs.CV
TL;DR: 本文提出了一种名为VisiPruner的无需训练的剪枝框架,基于对多模态大语言模型(MLLMs)三阶段跨模态交互过程的系统性分析,显著减少视觉相关计算开销,并为高效MLLM训练提供指导。
Details
Motivation: 现有的MLLM在处理多模态信息时存在计算开销大的问题,且缺乏对模型如何融合多模态信息的根本理解,因此需要一种更高效、基于机理理解的剪枝方法。 Method: 通过系统分析发现MLLM存在三阶段跨模态交互过程:浅层识别任务意图,中层突发式融合关键视觉信息,深层丢弃视觉信息进行语言细化;基于此提出VisiPruner,动态识别并保留关键视觉token,实现训练-free的注意力剪枝。 Result: VisiPruner在LLaVA-v1.5 7B上最多减少99%视觉相关注意力计算和53.9%的FLOPs,性能优于现有剪枝方法,并在多种MLLM上具有良好泛化性。 Conclusion: 理解MLLM的分层处理机制对于提升效率至关重要,VisiPruner不仅实现了高效的计算压缩,还为未来高效多模态模型的设计提供了可操作的架构优化方向。 Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.[216] KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation
WenBo Xu,Liu Liu,Li Zhang,Ran Zhang,Hao Wu,Dan Guo,Meng Wang
Main category: cs.CV
TL;DR: 提出KineDiff3D,一个用于类别级铰接物体形状重建和生成的统一框架,通过结合运动学感知VAE和条件扩散模型,实现从单视图输入中准确重建物体并估计其运动学参数。
Details
Motivation: 铰接物体由于多部件几何结构和可变关节配置,在3D重建和位姿估计中面临挑战,现有方法难以处理其结构多样性。 Method: 设计运动学感知VAE(KA-VAE)将完整几何(SDF)、关节角度和部件分割编码到结构化潜在空间;使用两个条件扩散模型分别回归全局位姿(SE(3))与关节参数、以及从部分观测生成潜在码;引入迭代优化模块通过Chamfer距离最小化双向优化重建结果与运动学参数。 Result: 在合成、半合成和真实数据集上的实验表明,该方法能有效且准确地重建铰接物体并估计其运动学属性。 Conclusion: KineDiff3D通过融合运动学感知表征与扩散模型,在类别级铰接物体的单视图重建与位姿估计任务中表现出优越性能。 Abstract: Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.[217] UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan
Main category: cs.CV
TL;DR: 本文提出了UltraCUA,一种结合GUI操作与高级程序化工具调用的混合动作基础模型,显著提升了计算机使用代理的性能和效率。
Details
Motivation: 现有的多模态计算机使用代理依赖于低级的GUI操作,容易因视觉定位不准和长执行链导致错误累积和性能瓶颈,且缺乏对高级程序接口的有效利用。 Method: 提出四部分方法:1)从软件文档、开源库和代码生成中自动扩展程序化工具;2)构建包含17,000多个可验证任务的合成数据引擎;3)收集大规模高质量的混合动作轨迹数据(包括低级GUI操作和高级工具调用);4)采用两阶段训练(监督微调+在线强化学习)以实现高低级动作的策略切换。 Result: 在OSWorld上,UltraCUA相比基线平均提升22%性能,步骤减少11%;在跨域的WindowsAgentArena上达到21.7%的成功率,优于在Windows数据上训练的基线模型。混合动作机制有效减少了错误传播并提升了执行效率。 Conclusion: UltraCUA通过融合低级GUI操作与高级程序化工具调用,解决了传统计算机使用代理的局限性,展现出更强的泛化能力与执行效率,为未来智能代理的发展提供了新方向。 Abstract: Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.[218] GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
Yinghui Wang,Xinyu Zhang,Peng Du
Main category: cs.CV
TL;DR: 提出GACO-CAD,一种两阶段后训练框架,通过引入深度和表面法线图作为几何先验,并结合强化学习中的组长度奖励,提升从单张图像生成CAD模型的几何精度和建模简洁性。
Details
Motivation: 现有视觉-语言模型在从2D图像推断3D几何结构时空间推理能力有限,导致生成的CAD模型几何不准确且建模过程冗长。 Method: 第一阶段在监督微调中使用RGB图像与深度图、表面法线图构成多通道输入,提供密集几何先验;第二阶段在强化学习中引入组长度奖励,鼓励生成更紧凑的参数化建模序列,并采用动态加权策略稳定训练。 Result: 在DeepCAD和Fusion360数据集上实验表明,GACO-CAD在相同MLLM主干下实现了最先进的性能,显著提升了代码有效性、几何准确性和建模简洁性。 Conclusion: GACO-CAD有效增强了多模态大模型在单视图CAD重建中的空间推理能力,兼顾高几何保真度与简洁的参数化建模流程。 Abstract: Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.[219] Glyph: Scaling Context Windows via Visual-Text Compression
Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang
Main category: cs.CV
TL;DR: 本文提出Glyph框架,通过将长文本渲染为图像并利用视觉语言模型处理,实现3-4倍的token压缩,显著降低计算和内存成本,同时保持与主流大语言模型相当的准确性,并加速推理和训练过程。
Details
Motivation: 长上下文建模在文档理解、代码分析等任务中日益重要,但扩展token序列的上下文窗口会导致高昂的计算和内存开销,限制了实际应用。因此需要一种更高效的长上下文处理方法。 Method: 提出Glyph框架,将长文本渲染为图像,使用视觉语言模型(VLM)进行处理,并设计基于大语言模型的遗传搜索算法,自动寻找最优的视觉渲染配置,在压缩率和准确性之间取得平衡。 Result: 实现了3-4倍的token压缩,在多个长上下文基准上性能接近Qwen3-8B等先进LLM;预填充和解码速度提升约4倍,SFT训练速度提升约2倍;在极端压缩下,128K上下文的VLM可处理百万级token任务,且生成的图文数据有助于真实场景的多模态任务。 Conclusion: 通过视觉化长文本并结合VLM处理,Glyph为长上下文建模提供了一种高效、可行的新范式,显著降低了资源消耗,同时具备良好的扩展性和实用性。 Abstract: Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.[220] Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition
Roland Croft,Brian Du,Darcy Joseph,Sharath Kumar
Main category: cs.CV
TL;DR: 研究探讨了在黑盒设置下,不同人脸预处理技术对人脸识别系统中对抗攻击迁移性的影响,发现人脸检测模型的选择显著影响攻击成功率,而插值方法影响较小。此外,提出了一种基于输入变换的预处理不变性方法,可将攻击迁移性提高最多27%。
Details
Motivation: 现有的人脸识别对抗攻击研究常忽视预处理步骤,尤其是在黑盒环境下,而预处理(如人脸检测和图像缩放)可能显著影响攻击效果。因此,有必要系统研究预处理对攻击迁移性的影响,并提升对抗样本的泛化能力。 Method: 评估多种现成的先进对抗攻击方法在不同人脸检测模型和插值方法下的迁移性;分析白盒和黑盒设置下预处理对攻击成功率的影响;提出一种基于输入变换的预处理不变性攻击方法以提升迁移性。 Result: 人脸检测模型的选择可使攻击成功率下降高达78%,而插值方法影响较小;在白盒设置中,预处理会因噪声与检测模型的意外交互而削弱攻击强度;所提出的输入变换方法可将攻击迁移性提升最多27%。 Conclusion: 人脸预处理在对抗攻击中起关键作用,常被低估。考虑预处理的不变性设计有助于提升对抗样本在不同人脸识别系统间的泛化能力和攻击迁移性,应成为未来研究的重点。 Abstract: Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.[221] Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Feihong Yan,Peiru Wang,Yao Zhu,Kaiyu Pang,Qingyan Wei,Huiqi Li,Linfeng Zhang
Main category: cs.CV
TL;DR: 提出了一种无需训练的分层采样策略GtR,通过结构生成和细节重建两阶段实现视觉生成加速,并结合频率加权令牌选择(FTS)提升细节计算资源分配,在保持生成质量的同时显著提高效率。
Details
Motivation: 掩码自回归模型(MAR)虽具备并行生成潜力,但受限于单步建模视觉令牌空间相关性的复杂性,难以充分发挥加速优势。 Method: 提出Generation then Reconstruction(GtR)策略,将生成分为结构生成和细节重建两个阶段;引入Frequency-Weighted Token Selection(FTS),基于高频能量定位图像细节区域并分配更多计算资源。 Result: 在ImageNet条件生成和文本到图像生成任务上实现3.72倍加速,FID为1.59,IS为304.4,质量与原始模型相当且优于现有加速方法。 Conclusion: GtR结合FTS有效平衡了生成速度与质量,显著提升了MAR模型的推理效率,适用于多种模型规模和生成任务。 Abstract: Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.[222] Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring
Yingzi Han,Jiakai He,Chuanlong Xie,Jianping Li
Main category: cs.CV
TL;DR: 本文首次对浮游生物识别中的分布外检测方法进行了大规模系统性评估,基于DYB-PlanktonNet数据集构建了多种分布偏移场景的基准,并系统测试了22种OoD检测方法,发现ViM方法在远分布外场景下表现最优。
Details
Motivation: 由于浮游生物形态复杂、物种多样性高且不断发现新物种,训练与测试数据之间常存在分布偏移,导致模型在实际部署中出现不可预测错误,而现有研究缺乏对最新OoD检测技术的整合和统一的大规模评估基准。 Method: 基于DYB-PlanktonNet数据集精心设计了一系列模拟不同分布偏移场景的OoD基准,并系统评估了22种OoD检测方法,重点分析其在近分布外和远分布外情况下的性能。 Result: 实验结果表明,ViM方法在所构建的基准上显著优于其他方法,尤其在远分布外场景下关键指标有大幅提升,为浮游生物识别中的OoD检测提供了有效方案。 Conclusion: 本研究填补了浮游生物识别领域在OoD检测系统性评估方面的空白,建立了首个大规模统一基准,为算法选择和未来研究奠定了基础。 Abstract: Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton's complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at https://github.com/BlackJack0083/PlanktonOoD.[223] Capturing Head Avatar with Hand Contacts from a Monocular Video
Haonan He,Yufeng Zheng,Jie Song
Main category: cs.CV
TL;DR: 本文提出了一种联合学习3D头部头像与手-脸交互引起的非刚性形变的新框架,解决了姿态跟踪和形变建模两个关键问题。
Details
Motivation: 现有方法多忽略手与面部的自然交互(如托腮、轻触脸颊),而这些交互能传达思考等认知状态,因此需要更完整的头像建模。 Method: 引入深度顺序损失和接触正则化来精确捕捉手与脸的相对位姿;通过构建包含手-脸交互的数据集,学习特定于手致面部形变的PCA基,并结合基于物理的接触损失以减少穿透伪影。 Result: 在iPhone采集的RGB(D)视频和自建合成数据集上验证,相比当前最优表面重建方法,该方法能更准确地恢复面部外观与动态形变几何。 Conclusion: 所提方法能有效实现包含手-脸交互的高保真3D头像重建,提升了虚拟现实、远程呈现等应用中的真实感与交互自然性。 Abstract: Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.[224] HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery
Vaibhav Rathore,Divyam Gupta,Biplab Banerjee
Main category: cs.CV
TL;DR: 提出HIDISC,一种基于双曲空间表示学习的广义类别发现框架,无需 episodic 训练即可实现跨域和类别级泛化,在多个基准上达到SOTA。
Details
Motivation: 现有GCD方法依赖同域的标注与未标注数据,难以应对开放世界中的分布偏移;现有DG-GCD方法计算成本高且易累积误差。 Method: 采用GPT引导的扩散增强源域;引入曲率感知的Tangent CutMix在切空间合成伪新类别样本;设计统一损失函数(结合惩罚Busemann对齐、混合双曲对比正则和自适应离群排斥);使用可学习曲率参数适应数据复杂性。 Result: 在PACS、Office-Home和DomainNet上均取得当前最优性能,显著优于现有的欧氏和双曲(DG)-GCD基线方法。 Conclusion: HIDISC通过双曲空间建模和几何感知的数据增强,有效解决了DG-GCD中的域泛化与新类别发现挑战,兼具高效性与强泛化能力。 Abstract: Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** -- available during training -- or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss -- combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion -- **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.[225] ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
Pu Zhang,Yuwei Li,Xingyuan Xian,Guoming Tang
Main category: cs.CV
TL;DR: 提出一种零样本的视觉token剪枝方法,通过结合任务相关性和信息多样性,在减少90% token的情况下仍保持性能,并显著降低推理成本。
Details
Motivation: 现有视觉-语言模型在处理大规模输入时存在视觉token冗余问题,导致推理成本过高,且现有剪枝方法忽略文本提示的指导,未能优先考虑任务相关性。 Method: 提出一种分层的、提示感知的视觉token剪枝方法,首先选择与任务相关的视觉token,再补充具有多样性的token以保留上下文信息。 Result: 在多个模型和基准上实验表明,该方法在仅造成轻微精度损失的情况下,性能达到或超过现有最先进方法,并显著减少GPU内存占用和推理延迟。 Conclusion: 该提示感知的分层剪枝策略有效平衡了任务相关性与信息多样性,实现了高效、通用的视觉token压缩。 Abstract: As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.[226] From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh
M Saifuzzaman Rafat,Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Jungpil Shin
Main category: cs.CV
TL;DR: 本研究利用Segment Anything Model(SAM)结合手动标注的历史影像数据,开发了一种高精度的河流侵蚀监测方法,用于追踪孟加拉国河流侵蚀导致的土地流失与村庄消失问题。
Details
Motivation: 孟加拉国主要河流每年造成严重河岸侵蚀,吞噬村庄和农田,但传统监测方法效率低下,难以准确追踪这一缓慢但持续的灾害过程。因此,亟需一种自动化、高精度的监测工具来支持政策制定与灾害应对。 Method: 研究团队构建了一个涵盖2003至2025年孟加拉国易受灾区域的历史Google Earth影像数据集,并对已消失的聚落进行手工标注;首先使用颜色通道分析粗略分割水陆区域,随后微调SAM模型的掩码解码器以识别河岸侵蚀特征。 Result: 该模型在土地损失分割任务中达到86.30%的平均交并比(IoU)和92.60%的Dice系数,显著优于传统方法和现成深度学习模型,能够精准识别被淹没的聚落并提供可视化证据。 Conclusion: 本研究提供了首个针对孟加拉国因河蚀而消失聚落的标注数据集、一个专用的AI模型及定量分析方法,为政策制定者和灾害管理机构提供了有效工具,有助于预测侵蚀趋势并保护受影响社区。 Abstract: The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh's most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM's mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.[227] Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis
Nirai Hayakawa,Kazumasa Shimari,Kazuma Yamasaki,Hirotatsu Hoshikawa,Rikuto Tsuchida,Kenichi Matsumoto
Main category: cs.CV
TL;DR: 本文提出了一种基于比赛画面小地图信息分析的VALORANT回合结果预测模型,通过引入角色位置和游戏事件等战术特征,结合TimeSformer视频识别模型,显著提升了预测准确率。
Details
Motivation: 现有电竞比赛结果预测研究多依赖于比赛日志和统计数据,难以捕捉复杂策略信息。因此,本文旨在利用比赛画面中的小地图信息提取详细战术特征,以提高对FPS游戏VALORANT中回合结果的预测能力。 Method: 基于TimeSformer视频识别模型,构建一个融合小地图信息中角色位置和其他游戏内事件等战术特征的回合结果预测模型,并使用带有战术事件标注的数据集进行训练与评估。 Result: 在增强了战术事件标签的数据集上训练的模型,在回合中后期实现了约81%的预测准确率,显著优于仅使用小地图信息的模型。 Conclusion: 利用比赛画面中的战术特征进行建模能有效提升VALORANT回合结果的预测性能,表明从视频中提取细粒度战术信息是电竞预测的一个有前景的方向。 Abstract: Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.[228] EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification
Bingrong Liu,Jun Shi,Yushan Zheng
Main category: cs.CV
TL;DR: 提出了一种名为EndoCIL的类增量学习框架,用于解决内镜图像分析中的灾难性遗忘、域差异和类别不平衡问题,在多个公开数据集上优于现有方法。
Details
Motivation: 现有的基于回放的类增量学习方法在内镜图像分析中因严重的域差异和类别不平衡而难以有效缓解灾难性遗忘,限制了临床应用中的持续学习能力。 Method: EndoCIL包含三个核心组件:基于最大均值差异的回放(MDBR),采用分布对齐的贪心策略选择多样且具代表性的样本;先验正则化类别平衡损失(PRCBL),通过引入先验类别分布和平衡权重缓解阶段间和阶段内类别不平衡;全连接梯度校准(CFG),调整分类器梯度以减少对新类别的偏好。 Result: 在四个公开内镜数据集上的实验表明,EndoCIL在不同缓冲区大小和评估指标下普遍优于最先进的类增量学习方法,有效平衡了稳定性与可塑性。 Conclusion: EndoCIL为内镜图像诊断提供了一个有效的类增量学习框架,具有良好的临床扩展性和部署潜力。 Abstract: Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.[229] Optimizing DINOv2 with Registers for Face Anti-Spoofing
Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki
Main category: cs.CV
TL;DR: 本文提出了一种基于DINOv2的活体检测方法,用于识别真实人脸与伪造人脸照片之间的细微差异,有效防御面部欺骗攻击。
Details
Motivation: 现有的面部识别系统容易受到伪造人脸照片的欺骗攻击,因此需要在识别前准确检测此类攻击以确保安全。 Method: 采用带有registers的DINOv2模型提取具有泛化性的特征,并通过抑制注意力机制中的扰动,使模型更聚焦于关键的细微特征,从而区分真实与伪造人脸。 Result: 在ICCV2025第六届面部反欺骗研讨会提供的数据集和SiW数据集上的实验表明,该方法能有效检测欺骗攻击,性能优于现有方法。 Conclusion: 基于DINOv2的方法在面部欺骗检测任务中表现出色,具备较强的鲁棒性和泛化能力,适用于实际应用场景中的活体检测。 Abstract: Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.[230] When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
Zhuo Cao,Heming Du,Bingqing Zhang,Xin Yu,Xue Li,Sen Wang
Main category: cs.CV
TL;DR: 本文提出了一个针对多片段检索(MMR)的新数据集QV-M$^2$和一种名为FlashMMR的框架,通过多片段后验证模块提升视频时序定位性能。
Details
Motivation: 现有单片段检索方法无法满足实际应用中一个查询对应多个相关片段的需求,导致与真实场景存在差距。 Method: 提出QV-M$^2$数据集及适配MMR的评估指标,并设计FlashMMR框架,包含约束时间调整和多片段后验证模块以优化片段边界。 Result: 在QV-M$^2$上,FlashMMR相比先前SOTA方法G-mAP提升3.00%,mAP@3+tgt提升2.70%,mR@3提升2.56%。 Conclusion: QV-M$^2$和FlashMMR为更贴近实际的视频时序定位研究提供了有效基准和强基线方法。 Abstract: Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.[231] Fair and Interpretable Deepfake Detection in Videos
Akihito Yoshii,Ryosuke Sonoda,Ramya Srinivasan
Main category: cs.CV
TL;DR: 提出一种公平性感知的深度伪造检测框架,结合时间特征学习和人群感知数据增强,提升检测的公平性、可解释性和泛化能力。
Details
Motivation: 现有深度伪造检测方法存在偏见、缺乏透明度且无法捕捉时间信息,导致在不同人群中的决策不公和结果不可靠。 Method: 采用基于序列的聚类进行时间建模,结合概念提取提升可解释性,并提出人群感知的数据增强方法,通过频域变换平衡少数群体并保留伪造痕迹。 Result: 在FaceForensics++、DFD、Celeb-DF和DFDC数据集上,结合Xception和ResNet等SOTA架构的实验表明,该方法在公平性与准确率之间取得了最佳权衡。 Conclusion: 所提框架有效提升了深度伪造检测的公平性、可解释性和泛化性能,优于现有最先进方法。 Abstract: Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.[232] FineVision: Open Data Is All You Need
Luis Wiedmann,Orr Zohar,Amir Mahla,Xiaohan Wang,Rui Li,Thibaud Frere,Leandro von Werra,Aritra Roy Gosthipaty,Andrés Marafioti
Main category: cs.CV
TL;DR: FineVision是一个精心收集、整理和统一的包含2400万样本的视觉-语言模型数据集,通过半自动化的人工参与流程整合了200多个来源,并进行了严格的去重和去污染处理,显著提升了模型性能。
Details
Motivation: 现有的视觉-语言模型研究受限于公共数据集的碎片化、不一致和污染问题,亟需一个高质量、大规模的统一数据资源。 Method: 通过半自动化的‘人在回路’流程,将200多个数据源整合为185个子集,结合自动化处理与人工审核,并实施跨源去重和对66个公开基准的去污染;同时支持代理/GUI任务并验证执行保真度。 Result: 在FineVision上训练的模型在广泛评估套件中 consistently 优于基于现有开源混合数据训练的模型,验证了数据规模、清洁度和人机协同流程的有效性。 Conclusion: FineVision是当前最大规模的开源视觉-语言数据集,其构建方法体现了数据质量与规模化管理的重要性,推动以数据为中心的VLM研究发展。 Abstract: The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.[233] Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models
Katie Luo,Jingwei Ji,Tong He,Runsheng Xu,Yichen Xie,Dragomir Anguelov,Mingxing Tan
Main category: cs.CV
TL;DR: 提出了一种名为Plug-and-Forecast(PnF)的即插即用方法,利用多模态大语言模型(MLLMs)增强现有运动预测模型,通过自然语言理解复杂场景,显著提升预测性能且无需微调。
Details
Motivation: 现有自动驾驶系统在多样化真实场景中泛化能力有限,难以低成本扩展,因此需要一种能快速适应复杂行为的通用解决方案。 Method: 设计提示词从多模态大语言模型中提取结构化的场景理解信息,并将其蒸馏为可学习的嵌入向量,用于增强现有的行为预测模型,整个过程无需微调。 Result: 在Waymo Open Motion和nuScenes两个数据集上验证了该方法,结合两种最先进的运动预测模型均实现了性能的持续提升。 Conclusion: PnF通过引入MLLM的零样本推理能力,提供了一种实用、高效的运动预测增强方案,具有良好的泛化性和应用前景。 Abstract: Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.[234] SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation
Mehdi Zekriyapanah Gashti,Mostafa Mohammadpour,Ghasem Farjamnia
Main category: cs.CV
TL;DR: 提出了一种基于显著性引导的跨层深度特征融合框架SG-CLDFF,用于提升白细胞分割与分类的准确性和可解释性。
Details
Motivation: 白细胞在显微图像中的分割与分类对血液病诊断至关重要,但受染色差异、复杂背景和类别不平衡影响,现有方法仍面临挑战。 Method: 设计了SG-CLDFF框架,结合显著性先验指导特征提取,采用EfficientSwin风格主干网络提取多尺度特征,并通过ResNeXt-CC启发的跨层融合模块聚合浅层与深层特征;采用多任务学习结构,联合优化分割与分类任务,引入类别感知损失和显著性对齐正则化以缓解类别不平衡并抑制背景激活。 Result: 在BCCD、LISC和ALL-IDB标准数据集上验证,SG-CLDFF在IoU、F1分数和分类精度上均优于主流CNN和Transformer基线模型,消融实验表明显著性预处理和跨层融合均有效提升性能。 Conclusion: SG-CLDFF提供了一种高效且可解释的白细胞分析方案,有助于推动自动化血液细胞分析在临床中的可靠应用。 Abstract: Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.[235] Machine Vision-Based Surgical Lighting System:Design and Implementation
Amir Gharghabi,Mahdi Hakiminezhad,Maryam Shafaei,Shaghayegh Gharghabi
Main category: cs.CV
TL;DR: 提出一种基于YOLOv11的自动化手术照明系统,通过检测术区上方蓝色标记自动调节LED光源,提升照明一致性并减轻外科医生疲劳。
Details
Motivation: 传统手术照明依赖手动调节,易导致医生疲劳、颈部劳损及光照不稳,影响手术精度与安全。 Method: 采用YOLOv11算法检测模拟手术场景中位于目标区域上方的蓝色球形标记,结合舵机云台系统控制高功率LED光源自动对准。 Result: YOLO模型在验证集上达到96.7%的mAP@50,系统能有效实现精准定位与稳定照明。 Conclusion: 该机器视觉驱动的照明方案可减少医生体力负担,提高术中光照一致性,有助于改善手术效果。 Abstract: Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.[236] Exploring Structural Degradation in Dense Representations for Self-supervised Learning
Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang
Main category: cs.CV
TL;DR: 本文发现自监督学习中存在一种反直觉现象:更长的训练可能导致密集预测任务性能下降,称之为自监督密集退化(SDD),并提出一种密集表示结构估计器(DSE)来有效评估和缓解该问题。
Details
Motivation: 观察到在自监督学习中,长时间训练反而损害密集预测任务性能,需有效衡量训练过程中的表征质量以进行模型选择和优化。 Method: 提出Dense representation Structure Estimator (DSE),包含类别相关性和有效维度度量,并基于DSE设计模型选择策略和正则化方法。 Result: 在16种SSL方法和4个基准上验证,DSE与下游任务性能高度相关,模型选择平均提升3.0% mIoU,且DSE正则化能有效缓解SDD。 Conclusion: SDD是自监督学习中普遍存在的问题,DSE为无标注条件下评估密集任务表征提供了有效手段,显著提升下游密集任务性能。 Abstract: In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at https://github.com/EldercatSAM/SSL-Degradation.[237] LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
ZhaoYang Han,Qihan Lin,Hao Liang,Bowen Chen,Zhou Liu,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了LongInsightBench,首个用于评估模型理解长视频能力的基准,重点关注语言、视角、动作等上下文元素,并融合视觉、音频和文本多模态信息。该基准包含约1000个信息密集的长视频,涵盖讲座、访谈和vlog等内容,并设计了六种具有挑战性的任务场景及严格的质量保证流程。实验表明,现有多模态模型在精确时间定位和长距离因果推理任务上仍面临困难,且存在信息丢失和融合偏差问题。
Details
Motivation: 现有的视频理解基准多集中于短时、低信息密度的视频,难以评估模型对长时、复杂多模态内容的理解能力,尤其是在涉及语言、行为和上下文推理方面。因此,需要一个专门针对长视频、高信息密度且融合多模态信号的评估基准。 Method: 构建了一个名为LongInsightBench的基准,包含三个核心部分:1)从FineVideo中筛选约1000个满足时长与信息密度要求的长视频;2)设计六种涵盖事件内与事件间推理的挑战性任务;3)采用三步半自动化流程确保问题与选项的质量。基于此基准进行了多项实验,评估主流多模态模型的表现。 Result: 实验结果显示,当前的多模态模型在需要精确时间定位(T-Loc)和长距离因果推理(CE-Caus)的任务上表现不佳;扩展实验揭示了模型在多模态融合过程中存在信息丢失和处理偏见的问题。 Conclusion: LongInsightBench为评估长视频理解提供了有效且具挑战性的基准,揭示了现有模型在时间定位与因果推理方面的不足,强调了改进多模态融合机制的重要性。 Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.[238] CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference
Sangyoon Bae,Jiook Cha
Main category: cs.CV
TL;DR: 本文提出CausalMamba,一种用于fMRI数据因果推断的可扩展框架,通过分解为BOLD信号去卷积和基于条件Mamba架构的因果图推断两阶段,显著提升了神经因果推断的准确性和实用性。
Details
Motivation: 解决基于fMRI的因果推断中存在的两个核心问题:从血流动力学扭曲的BOLD信号中推断神经因果关系的不适定性,以及现有方法(如DCM)计算不可行的问题。 Method: 将复杂的逆问题分解为两个阶段:首先进行BOLD去卷积以恢复潜在神经活动,然后使用新型条件Mamba架构进行因果图推断。 Result: 在模拟数据上比DCM准确率提高37%;在真实任务fMRI数据上以88%保真度恢复已知神经通路,而传统方法在99%以上受试者中未能识别;工作记忆数据分析揭示大脑根据刺激灵活切换主要因果枢纽网络。 Conclusion: CausalMamba为神经科学家提供了一种实用的大规模因果推断工具,能够捕捉基本的神经回路模式和认知功能背后的动态网络重组机制。 Abstract: We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.[239] A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World
Wei Zhang,Zhanhao Hu,Xiao Li,Xiaopei Zhu,Xiaolin Hu
Main category: cs.CV
TL;DR: 本文研究了现有针对对抗性补丁的防御方法在面对大面积对抗性衣物攻击时的有效性,发现这些防御方法在数字和物理世界中均表现不佳,并展示了一套对抗性衣物能够成功攻破多种防御机制。
Details
Motivation: 由于实验表明增大补丁尺寸即可使现有防御方法失效,因此作者希望评估现有防御方法在面对更自然且覆盖面积更大的对抗性衣物攻击时的鲁棒性。 Method: 作者设计并测试了对抗性衣物对多种防御方法的影响,包括在Faster R-CNN上测试单一衣物组合对多个防御模型的攻击效果,并在数字和物理世界中进行实验验证。 Result: 实验结果显示所有测试的防御方法在对抗性衣物攻击下性能显著下降;一套对抗性衣物在物理世界中对无防御检测器实现了96.06%的攻击成功率,并对九种防御模型均保持超过64.84%的攻击成功率。 Conclusion: 现有对抗性防御方法在面对大面积、自然外观的对抗性衣物攻击时存在普遍漏洞,亟需更鲁棒的防御机制。 Abstract: In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: https://github.com/weiz0823/adv-clothes-break-multiple-defenses.[240] CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration
Gyuhwan Park,Kihyun Na,Injung Kim
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的字符级引导框架CharDiff,用于恢复和识别严重退化的车牌图像,通过区域掩码机制实现精确的字符级注意力,在真实场景下显著提升了恢复质量和识别准确率。
Details
Motivation: 车牌图像的恢复不仅对车牌识别系统至关重要,还影响证据价值和视觉清晰度,但现有方法在处理严重退化图像时表现不足,因此需要更鲁棒的方法。 Method: 提出CharDiff框架,利用外部分割和OCR模块提取字符级先验,并设计CHARM模块通过区域掩码确保字符引导仅限于对应区域,实现精准指导。 Result: 在Roboflow-LP数据集上,相比最佳基线模型,CharDiff在CER上实现了28%的相对降低,显著优于基线模型。 Conclusion: 结构化的字符级引导能有效增强基于扩散模型的车牌图像恢复与识别在实际应用中的鲁棒性。 Abstract: The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character's guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.[241] iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA
Zhaoran Zhao,Xinli Yue,Jianhui Sun,Yuhao Xie,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng
Main category: cs.CV
TL;DR: 本文提出iDETEX,一种统一的多模态大语言模型,用于实现图像质量评估中的质量定位、感知和描述三个关键任务,并在大规模ViDA-UGC基准上实现了最先进的性能。
Details
Motivation: 为了应对细粒度且可解释的图像质量评估(IQA)挑战,推动IQA从标量预测向更符合人类感知的评估范式发展。 Method: 设计了一个统一的多模态大语言模型iDETEX,结合任务特定的离线增强模块和数据混合策略,并引入在线增强策略以充分利用多源监督信号。 Result: 在ViDA-UGC基准上,iDETEX在所有子任务中均达到最先进性能,并在ICCV MIPI 2025挑战赛中排名第一。 Conclusion: iDETEX能够有效且鲁棒地提供准确、可解释的图像质量评估,展示了其在详细IQA任务中的强大能力。 Abstract: Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.[242] Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition
Jiahao Huo,Mufhumudzi Muthivhi,Terence L. van Zyl,Fredrik Gustafsson
Main category: cs.CV
TL;DR: 提出了一种基于最近类均值(NCM)的后处理开集识别方法,通过衡量特征与预测logits之间的一致性,在野生动物分类中实现了优越且稳定的性能。
Details
Motivation: 现有的开集识别方法通常需要重新训练模型,且在不同数据集上表现不稳定,因此需要一种无需重训练且泛化能力强的后处理方法。 Method: 提出一种基于输入到其最近类均值(NCM)距离的概率分布,并将其与softmax概率对比,以衡量NCM与分类头之间的一致性,从而实现开集识别。 Result: 在非洲和瑞典动物两个数据集上分别达到93.41和95.35的AUROC,排名前三,且跨数据集表现稳定。 Conclusion: 该方法是一种有效的后处理开集识别策略,无需重新训练模型,在多个数据集上表现出色且稳定。 Abstract: Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models' features and predicted logits. We propose a probability distribution based on an input's distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found https://github.com/Applied-Representation-Learning-Lab/OSR.[243] Exploring The Missing Semantics In Event Modality
Jingqian Wu,Shengpeng Xu,Yunbo Jia,Edmund Y. Lam
Main category: cs.CV
TL;DR: 本文提出了一种名为Semantic-E2VID的事件到视频重建框架,通过引入跨模态特征对齐模块和语义感知特征融合块,利用帧基础视觉模型(SAM)中的语义知识来增强事件相机数据的语义恢复能力,并设计了新的语义感知感知监督策略,显著提升了重建视频的质量和语义细节,在多个基准上优于现有方法。
Details
Motivation: 事件相机仅捕捉亮度变化,缺乏静态物体和背景的语义信息,导致现有事件到视频重建方法在语义恢复方面表现不足,因此需要引入外部语义知识以提升重建质量。 Method: 提出Semantic-E2VID框架,包含跨模态特征对齐(CFA)模块,将基于帧的视觉基础模型SAM的语义知识迁移到事件编码器中,并通过语义感知特征融合(SFF)模块整合语义特征;同时设计语义感知E2V监督机制,利用SAM生成的类别标签指导训练。 Result: 实验表明,Semantic-E2VID在多个基准上显著优于现有的最先进E2V方法,有效提升了重建图像的视觉质量和语义完整性。 Conclusion: 通过引入外部语义知识并设计有效的特征融合与监督机制,可以显著改善事件相机在视频重建中的语义缺失问题,为未来事件-based视觉任务提供了新思路。 Abstract: Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.[244] M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
U. V. B. L Udugama,George Vosselman,Francesco Nex
Main category: cs.CV
TL;DR: 本文提出了一种名为Multi-Mono-Hydra(M2H)的新型多任务学习框架,用于从单目图像中实现语义分割、深度估计、边缘检测和表面法线估计,其基于轻量级ViT的DINOv2主干网络和窗口跨任务注意力模块,在保持计算效率的同时提升了多任务预测的一致性和性能。
Details
Motivation: 在边缘设备上实现实时空间感知需要高效的多任务模型,能够在利用任务间互补信息的同时最小化计算开销。传统方法或使用独立单任务模型,或共享编码-解码结构,难以兼顾性能与效率。 Method: 提出Multi-Mono-Hydra(M2H)框架,引入窗口跨任务注意力模块(Window-Based Cross-Task Attention Module),实现结构化特征交换并保留任务特异性细节;采用轻量级ViT-based DINOv2作为主干网络,优化实时部署性能。 Result: 在NYUDv2上优于最先进的多任务模型,在Hypersim上超过单任务深度和语义基准,在Cityscapes上表现更优,且在笔记本硬件上保持高效计算;并在真实世界数据中验证了实用性。 Conclusion: M2H在多任务学习中实现了精度与效率的良好平衡,适用于实时单目空间感知系统,支持动态环境中的3D场景图构建,具有良好的实际部署潜力。 Abstract: Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.[245] Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Vaggelis Dorovatas,Soroush Seifi,Gunshi Gupta,Rahaf Aljundi
Main category: cs.CV
TL;DR: 提出一种无需训练的方法,用于提升视频大语言模型在流式视频处理中的效率和效果,通过选择关键视觉令牌、循环利用历史信息和基于字幕的轻量级问答。
Details
Motivation: 标准视频大语言模型在处理长时流式视频时面临计算资源消耗大、响应不及时的问题,需要一种高效且有效的在线处理方法。 Method: 采用注意力机制筛选对理解有贡献的视觉令牌,保留约5%的重要令牌;使用循环机制整合历史选定令牌以保持时间连贯性;结合基于字幕的轻量级问答生成答案。 Result: 在流式视频基准上达到最先进的性能,在显著减少计算开销的同时保持高问答准确率。 Conclusion: 该方法无需训练即可兼容现有Video-LLMs,在效率与效果之间取得良好平衡,适用于长时间流式视频的理解任务。 Abstract: Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.[246] Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise
Paweł Borsukiewicz,Fadi Boutros,Iyiola E. Olatunji,Charles Beumier,Wendkûuni C. Ouedraogo,Jacques Klein,Tegawendé F. Bissyandé
Main category: cs.CV
TL;DR: 本研究系统评估了合成面部识别数据集的可行性,通过实验验证了25个合成数据集在隐私保护、准确性及偏见缓解方面的表现,发现最佳合成数据集(如VariFace、VIGFace)的识别准确率已超过传统真实数据集,证明合成数据是面部识别研究中科学可行且符合伦理的替代方案。
Details
Motivation: 面部识别系统的高精度依赖于大规模非授权收集的真实人脸数据,引发隐私侵犯和法律风险(如GDPR),亟需一种既能保护隐私又能维持性能的替代方案。合成人脸数据虽具潜力,但其有效性缺乏系统性实证支持,因此需要全面评估其替代真实数据的能力。 Method: 结合系统性文献综述与实证实验,识别并分析2018-2025年间的25个合成面部识别数据集;提出七个关键评估维度:身份泄露防护、类内变异性、身份可分性、数据规模、伦理数据来源、偏见缓解和基准可靠性;基于超过1000万张合成图像进行实验,并在五个标准基准上比较性能。 Result: 表现最佳的合成数据集VariFace和VIGFace分别达到95.67%和94.91%的识别准确率,超过CASIA-WebFace(94.70%);公开可用的Vec2Face和CemiFace也接近该水平(93.52%和93.22%);合成数据能有效保持类内变异性与身份可分性;尽管仍继承少量偏差,但可通过生成参数进行主动控制以缓解偏见。 Conclusion: 合成面部数据已在科学性能上达到甚至超越真实数据集,同时满足隐私保护与伦理合规要求,应成为面部识别研究的首选替代方案。 Abstract: The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data's capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.[247] Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
Yintao Zhou,Wei Huang,Zhengyu Li,Jing Huang,Meng Pang
Main category: cs.CV
TL;DR: 提出一种基于多表情特征融合和自适应类别平衡的帕金森病(PD)严重程度诊断方法,有效提升了诊断性能。
Details
Motivation: 现有基于面部表情的PD诊断方法多依赖单一表情类型且忽视类别不平衡问题,导致误诊率高,且多数方法仅限于二分类而非严重程度分级。 Method: 通过注意力机制进行多表情特征融合,并采用自适应类别平衡策略动态调整训练样本权重以缓解类别不平衡。 Result: 实验结果表明该方法在PD严重程度诊断上表现优异,注意力融合与自适应平衡策略均有效提升模型性能。 Conclusion: 所提方法能有效利用多表情信息并应对类别不平衡,显著提高PD严重程度诊断的准确性。 Abstract: Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.[248] Closed-Loop Transfer for Weakly-supervised Affordance Grounding
Jiajin Tang,Zhengxuan Wei,Ge Zheng,Sibei Yang
Main category: cs.CV
TL;DR: 本文提出了LoopTrans,一种新颖的闭环框架,通过双向知识迁移(从外部视角到自我中心视角并反向)来改进弱监督下的可操作性定位。
Details
Motivation: 现有方法仅从外部视角图像中提取可操作性知识并单向迁移到自我中心图像,限制了在复杂交互场景中的应用。 Method: 提出LoopTrans框架,包含统一的跨模态定位和去噪知识蒸馏机制,实现双视角间知识的相互增强与转移。 Result: 实验表明,LoopTrans在图像和视频基准上所有指标均取得一致提升,甚至能处理人体完全遮挡交互区域的挑战场景。 Conclusion: LoopTrans通过闭环知识迁移显著提升了弱监督可操作性定位的性能,增强了模型在复杂真实场景中的适用性。 Abstract: Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.[249] Monitoring Horses in Stalls: From Object to Event Detection
Dmitrii Galimzianov,Viacheslav Vyshegorodtsev,Ivan Nezhivykh
Main category: cs.CV
TL;DR: 提出了一种基于视觉的马匹行为监测系统,利用YOLOv11和BoT-SORT实现马匹与人的检测跟踪,并通过轨迹和空间关系推断事件状态。
Details
Motivation: 传统马匹行为监测依赖人工,耗时费力,难以持续有效地发现健康与福利问题。 Method: 采用YOLOv11进行目标检测,BoT-SORT实现多目标跟踪,结合目标轨迹与 stall 内空间关系推断五类事件状态,并利用CLIP和GroundingDINO辅助构建标注数据集。 Result: 系统在马匹相关事件上表现可靠,能有效处理摄像头盲区,但因数据不足,在人员检测方面存在局限。 Conclusion: 该系统为马厩环境下的实时行为监测提供了可行方案,有助于提升马匹福利和稳定管理效率。 Abstract: Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.[250] DeepDetect: Learning All-in-One Dense Keypoints
Shaharyar Ahmed Khan Tareen,Filza Khan Tareen
Main category: cs.CV
TL;DR: 本文提出了一种名为DeepDetect的智能、一体化密集关键点检测器,通过融合多种传统检测器输出生成真值掩码,并利用轻量级ESPNet模型进行训练,在关键点密度、重复性和匹配准确性方面优于现有方法。
Details
Motivation: 传统和基于学习的关键点检测器对光照变化敏感、关键点密度和重复性低、适应性差且缺乏语义理解,难以优先关注视觉重要区域。 Method: 融合7种关键点检测器和2种边缘检测器的输出生成真值掩码,使用这些掩码作为标签训练轻量高效的ESPNet模型,实现语义感知的密集关键点检测。 Result: 在Oxford Affine Covariant Regions数据集上的实验表明,DeepDetect在平均关键点密度(0.5143)、平均重复性(0.9582)和正确匹配数(59,003)上均优于其他检测器。 Conclusion: DeepDetect通过结合经典检测器的优势与深度学习,实现了在复杂和视觉退化条件下鲁棒、密集且语义感知的关键点检测,显著提升了性能。 Abstract: Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).[251] Leveraging AV1 motion vectors for Fast and Dense Feature Matching
Julien Zouein,Hossein Javidnia,François Pitié,Anil Kokaram
Main category: cs.CV
TL;DR: 利用AV1运动矢量生成密集亚像素对应关系和短轨迹,提供一种计算效率高且资源消耗低的前端方法,适用于SfM等视觉任务。
Details
Motivation: 探索压缩域信息(如AV1运动矢量)在视觉匹配与三维重建中的潜力,以降低传统特征提取方法对CPU资源的高消耗。 Method: 重用AV1编码中的运动矢量生成密集的亚像素级对应点,并通过余弦一致性过滤短轨迹;在压缩域直接处理视频帧,避免解码开销。 Result: 在117帧的视频片段上实现图像配准并重建0.46-0.62百万个点,重投影误差为0.51-0.53像素;相比SIFT等传统方法显著降低CPU使用,同时获得更密集的匹配。 Conclusion: 压缩域对应关系是一种实用且高效的前端方案,在完整视觉流程中具有良好的可扩展性,适合资源受限场景下的大规模应用。 Abstract: We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.[252] Rethinking Nighttime Image Deraining via Learnable Color Space Transformation
Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin,Shumin Fan,Tianyu Song,Jinshan Pan
Main category: cs.CV
TL;DR: 本文提出了一种新的夜间图像去雨方法,包括一个高质量的基准数据集HQ-NightRain和一种有效的颜色空间变换网络CST-Net,通过可学习的颜色空间转换器和隐式光照引导来提高去雨效果。
Details
Motivation: 夜间图像去雨由于夜间场景的固有复杂性和缺乏准确表示雨与光照耦合效应的高质量数据集而面临重大挑战。 Method: 提出了一个可学习的颜色空间转换器(CSC),以更好地在Y通道中促进去雨,并引入了隐式光照引导来捕捉光照信息,指导夜间去雨。 Result: 大量实验表明了所提数据集的价值和方法的有效性。 Conclusion: 所提出的CST-Net模型结合HQ-NightRain数据集能够有效提升夜间图像去雨的质量。 Abstract: Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model's robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at https://github.com/guanqiyuan/CST-Net.[253] Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS
Feng Zhou,Wenkai Guo,Pu Cao,Zhicheng Zhang,Jianqin Yin
Main category: cs.CV
TL;DR: 本文提出了一种用于稀疏视角3D高斯点阵的改进初始化方法,通过频率感知SfM、3DGS自初始化和点云正则化,显著提升了新视角渲染质量。
Details
Motivation: 稀疏视角下的3D高斯点阵容易过拟合训练视角,导致新视角渲染出现模糊等伪影。现有方法依赖于优化过程中的正则化或初始点云质量,但作者发现初始化是决定性能上限的关键因素。 Method: 在SfM提供的种子点基础上,提出了三种设计:(i) 频率感知SfM,通过低频视图增强和松弛的多视图对应关系改善低纹理区域覆盖;(ii) 3DGS自初始化,利用光度监督生成额外的高斯中心以补偿SfM稀疏区域;(iii) 点云正则化,通过几何/可见性先验保证多视图一致性和空间均匀分布。 Result: 在LLFF和Mip-NeRF360数据集上的实验表明,该方法在稀疏视角设置下 consistently 提升了渲染性能,优于现有初始化策略。 Conclusion: 初始化是稀疏视角3DGS性能的决定性因素,本文提出的初始化策略有效弥补了SfM在稀疏视角下的不足,为高质量新视角合成提供了更可靠的起点。 Abstract: Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization's primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at https://github.com/zss171999645/ItG-GS.[254] SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
Chenxu Dang,Haiyan Liu,Guangjun Bao,Pei An,Xinyue Tang,Jie Ma,Bingchuan Sun,Yan Wang
Main category: cs.CV
TL;DR: 本文提出了SparseWorld,一种基于稀疏动态查询的新型4D占据世界模型,通过范围自适应感知和状态条件预测模块实现灵活、自适应且高效的环境建模,在感知、预测和规划任务中达到SOTA性能。
Details
Motivation: 现有占据世界模型依赖静态固定嵌入或网格,感知灵活性受限,且基于网格的“原位分类”难以匹配真实场景的动态连续性。 Method: 提出SparseWorld,包含范围自适应感知模块(查询由自车状态调制并融合时空关联)和状态条件预测模块(用回归引导替代分类预测),并设计时序感知自调度训练策略。 Result: 在多个任务上实现最先进性能,可视化与消融实验验证了模型在灵活性、适应性和效率方面的优势。 Conclusion: SparseWorld通过动态稀疏查询机制有效提升了4D占据世界模型的灵活性与动态对齐能力,为自动驾驶等应用提供了更优的世界建模方案。 Abstract: Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios.In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at https://github.com/MSunDYY/SparseWorld.[255] Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment
Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,Noman Ali,Usman Ali,Wei Xiang
Main category: cs.CV
TL;DR: 提出AutoSOD,一种端到端无监督显著性目标检测方法,通过改进原型最优传输(POTNet)生成高质量伪掩码,在多个基准上显著超越现有无监督和弱监督方法。
Details
Motivation: 希望在无需任何像素级标注的情况下,使显著性目标检测达到接近全监督模型的精度,前提是能够生成可靠的伪掩码。 Method: 提出POTNet,采用熵引导的双聚类头:高熵像素用谱聚类,低熵像素用k-means,并通过最优传输对齐两个原型集;利用生成的伪掩码监督MaskFormer风格的编码器-解码器,构建端到端的AutoSOD框架。 Result: 在五个基准上实验表明,AutoSOD在F-measure指标上比现有无监督方法最高提升26%,弱监督方法最高提升36%,显著缩小了与全监督模型的差距。 Conclusion: AutoSOD通过改进原型学习与最优传输机制,实现了高质量伪掩码生成,为无监督显著性检测提供了一种高效、准确的端到端解决方案。 Abstract: Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.[256] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
Yuanli Wu,Long Zhang,Yue Du,Bin Li
Main category: cs.CV
TL;DR: 提出一种基于评分标准引导的伪标签提示框架,用于零样本视频摘要,通过少量真实标注生成高质量伪标签,并利用上下文提示提升摘要的连贯性与显著性,在SumMe和TVSum数据集上表现优异。
Details
Motivation: 现有监督方法依赖大量标注、泛化能力差,无监督方法难以捕捉高层语义,而零样本方法对提示模板敏感且需数据集特定归一化,因此需要更稳定、通用且无需训练的视频摘要方法。 Method: 设计一个评分标准引导的伪标签框架,将少量真实标注转化为高置信度伪标签,构建结构化、自适应的评分标准;推理时,首尾片段仅基于描述打分,中间片段结合相邻场景上下文评估叙事连贯性与冗余性,实现无需微调的上下文感知提示。 Result: 在SumMe和TVSum数据集上分别取得57.58和63.05的F1分数,优于无监督和先前零样本方法,接近监督方法性能。 Conclusion: 评分标准引导的伪标签方法有效提升了大语言模型在零样本视频摘要中的稳定性与可解释性,建立了一种通用、无需训练的视频摘要新范式。 Abstract: With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.[257] MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
Yongshun Zhang,Zhongyi Fan,Yonghang Zhang,Zhangzikang Li,Weifeng Chen,Zhongwei Feng,Chaoyue Wang,Peng Hou,Anxiang Zeng
Main category: cs.CV
TL;DR: 本文提出了一种针对大规模视频生成模型的高效训练框架,优化了数据处理、模型架构、训练策略和基础设施四个方面,显著提升了训练效率和生成性能。所提出的MUG-V 10B模型在电商视频生成任务上优于现有开源模型,并首次开源了基于Megatron-Core的完整训练代码与推理流程。
Details
Motivation: 训练大规模视频生成模型面临跨模态对齐、长序列建模和复杂时空依赖等挑战,且资源消耗大,亟需高效的训练框架来提升可扩展性和实用性。 Method: 通过四个方面的系统性优化:(i) 高效的数据处理流程;(ii) 适配视频特性的模型架构;(iii) 包括课程学习预训练和对齐导向后训练的训练策略;(iv) 基于Megatron-Core的高性能分布式训练基础设施。 Result: MUG-V 10B模型在整体性能上达到当前最先进水平,在电商视频生成任务中超越主流开源基线模型;训练效率显著提升,实现了接近线性的多节点扩展;并开源了包含模型权重、训练代码和推理管道的完整技术栈。 Conclusion: 该工作提供了一个高效、可扩展的大规模视频生成训练框架,推动了视频生成技术的发展,并通过全面开源促进了相关领域的研究与应用。 Abstract: In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.[258] MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation
Yovin Yahathugoda,Davide Prezzi,Piyalitt Ittichaiwong,Vicky Goh,Sebastien Ourselin,Michela Antonelli
Main category: cs.CV
TL;DR: 提出MambaX-Net,一种半监督双扫描3D分割网络,用于前列腺癌主动监测中的纵向MRI分析,利用前一时相信息提升分割性能。
Details
Motivation: 现有深度学习模型多基于单时间点且依赖专家标注的数据,难以适应主动监测中多时间点和标注稀缺的纵向分析需求。 Method: 设计MambaX-Net,引入Mamba增强的交叉注意力模块和形状提取模块,结合前一时间点的MRI与分割掩码进行当前时相分割,并采用基于伪标签的半监督自训练策略。 Result: 在纵向主动监测数据集上,MambaX-Net显著优于U-Net和Transformer类模型,即使在标注有限且噪声较多的情况下仍表现出优越的前列腺区域分割能力。 Conclusion: MambaX-Net有效解决了主动监测中跨时间点分割与标注稀缺的挑战,具备临床应用潜力。 Abstract: Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.[259] WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection
Nachuan Ma,Zhengfei Song,Qiang Hu,Xiaoyu Tang,Chengxi Zhang,Rui Fan,Lihua Xie
Main category: cs.CV
TL;DR: 本文提出了一种名为WP-CrackNet的弱监督道路裂缝检测方法,仅使用图像级标签即可实现像素级检测,显著降低了对昂贵像素级标注的依赖。
Details
Motivation: 减少对高成本像素级标注的依赖,推动智能城市基础设施维护的可扩展性。 Method: WP-CrackNet包含分类器、重建器和检测器三个模块,通过对抗学习和伪标签生成进行联合训练;引入路径感知注意力模块(PAAM)融合高低层特征,并设计中心增强CAM一致性模块(CECCM)优化伪标签质量。 Result: 在三个自制图像级标签数据集上实验表明,WP-CrackNet性能媲美全监督方法,并显著优于现有弱监督方法。 Conclusion: WP-CrackNet为道路裂缝检测提供了一种高效、可扩展的弱监督解决方案,有助于推动智能交通系统中的自动化巡检技术发展。 Abstract: Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at https://mias.group/WP-CrackNet/.[260] PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
Kaichen Zhou,Yuhan Wang,Grace Chen,Xinhai Chang,Gaspard Beaudouin,Fangneng Zhan,Paul Pu Liang,Mengyu Wang
Main category: cs.CV
TL;DR: 提出PAGE-4D,一种扩展自VGGT的前馈模型,用于动态场景下的4D重建,实现相机位姿估计、深度预测和点云重建,无需后处理。
Details
Motivation: 现有3D前馈模型在静态数据集上训练,难以应对包含运动人体或可变形物体等复杂动态元素的真实场景。 Method: 扩展VGGT模型,引入动态感知聚合模块,通过预测动态感知掩码来解耦静态与动态信息,在位姿估计中抑制动态区域,在几何重建中增强动态建模。 Result: 在动态场景下,PAGE-4D在相机位姿估计、单目与视频深度估计及密集点云重建任务上均优于原始VGGT。 Conclusion: PAGE-4D有效解决了多任务4D重建中位姿估计与几何重建之间的任务冲突,实现了无需后处理的端到端动态场景重建。 Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.[261] Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset
Chuhong Wang,Hua Li,Chongyi Li,Huazhong Liu,Xiongxin Tang,Sam Kwong
Main category: cs.CV
TL;DR: 本文提出了首个水下伪装实例分割数据集UCIS4K,并设计了基于Segment Anything Model的UCIS-SAM网络,包含通道平衡优化、频域真值融合和多尺度特征频率聚合模块,显著提升了水下伪装物体的分割性能。
Details
Motivation: 由于水下环境存在颜色失真、低对比度和模糊等问题,且现有伪装分割方法主要基于陆地数据集,难以有效应对水下伪装物体的精确分割挑战。 Method: 提出UCIS-SAM模型,包含三个关键模块:通道平衡优化模块(CBOM)增强水下特征学习;频域真值融合模块(FDTIM)突出物体本质特征并抑制伪装干扰;多尺度特征频率聚合模块(MFFAM)强化低对比度边缘信息。 Result: 在UCIS4K数据集及公开基准上的实验表明,UCIS-SAM优于当前最先进的方法,显著提升水下伪装实例分割效果。 Conclusion: UCIS4K数据集和UCIS-SAM模型为水下伪装实例分割提供了有效解决方案,推动了水下视觉任务的发展。 Abstract: With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.[262] ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling
Shuyuan Zhang,Chenhan Jiang,Zuoou Li,Jiankang Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为ShapeCraft的新型多智能体框架,用于从自然语言生成结构化、可交互的3D资产。该方法采用基于图的程序化形状(GPS)表示,将复杂语言分解为子任务图,提升大模型对空间关系和语义细节的理解与建模能力。
Details
Motivation: 现有文本到3D生成方法常产生非结构化网格且交互性差,难以融入艺术创作流程。因此需要一种能生成结构化、可编辑且语义丰富的3D资产的方法。 Method: 提出Graph-based Procedural Shape (GPS)表示法,结合多智能体框架,利用LLM代理分层解析用户输入,初始化并迭代优化程序化建模与着色过程,生成结构化、带纹理且可交互的3D资产。 Result: 实验表明,ShapeCraft在几何准确性与语义丰富性方面优于现有的基于LLM的生成方法,并支持动画和用户自定义编辑,展现出良好的交互应用潜力。 Conclusion: ShapeCraft通过程序化表示和多智能体协作,有效提升了文本到3D生成的结构化程度与可交互性,适用于艺术创作等实际应用场景。 Abstract: 3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.[263] Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation
Siqi Chen,Shanyue Guan
Main category: cs.CV
TL;DR: 提出一种基于机器学习的框架,利用无人机扫描点云和BIM生成的合成数据,实现3D基础设施模型中结构组件的自动分割,显著提高精度与效率。
Details
Motivation: 传统手动标注3D模型中的结构组件耗时且易出错,亟需自动化方法以提升结构健康监测的效率和准确性。 Method: 结合真实无人机扫描的点云数据与建筑信息模型(BIM)生成的合成数据,构建机器学习框架,用于3D点云的自动分割。 Result: 在铁路轨道数据集上的验证表明,该方法能高精度识别和分割轨道、轨枕等主要构件,且使用较小规模数据集结合BIM数据可显著缩短训练时间并保持良好准确率。 Conclusion: 该自动化框架提升了3D基础设施模型分割的精度与效率,推动了无人机与BIM技术在结构健康监测和基础设施管理中的融合应用。 Abstract: The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.[264] One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection
Jia Guo,Shuai Lu,Lei Fan,Zelin Li,Donglin Di,Yang Song,Weihang Zhang,Wenbing Zhu,Hong Yan,Fang Chen,Huiqi Li,Hongen Liao
Main category: cs.CV
TL;DR: Dinomaly2是一个统一的无监督异常检测框架,通过极简设计在多类、多模态和少样本等多样化场景中实现全面优越性能,显著缩小了多类模型与单类模型之间的性能差距。
Details
Motivation: 现有无监督异常检测方法在多类设置下性能不足,且领域碎片化为特定场景开发专用方法,缺乏一个通用、统一的解决方案。 Method: 提出Dinomaly2,基于‘少即是多’的理念,在标准重建框架中协调五个简单组件,以极简方法实现高性能,并可自然扩展到多种任务和模态。 Result: 在12个基准上验证了Dinomaly2的优越性,在多类、多视图、RGB-3D、RGB-IR、少样本等设置下均达到SOTA性能;例如在MVTec-AD和VisA上分别取得99.9%和99.3%的图像级AUROC,仅用8个正常样本即超越以往全量样本模型。 Conclusion: Dinomaly2通过极简设计实现了真正的通用性,兼具计算可扩展性和广泛适用性,是面向现实世界全谱异常检测应用的统一解决方案。 Abstract: Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.[265] CaMiT: A Time-Aware Car Model Dataset for Classification and Generation
Frédéric LIN,Biruk Abere Ambaw,Adrian Popescu,Hejer Ammar,Romaric Audigier,Hervé Le Borgne
Main category: cs.CV
TL;DR: 本文提出了一个名为CaMiT的新数据集,用于研究汽车模型随时间演化的细粒度视觉识别与生成问题,并引入了时间增量学习设置以提升模型的时间鲁棒性。
Details
Motivation: AI系统需要适应不断变化的视觉环境,尤其是在物体外观随时间演变的场景中。现有模型在跨时间泛化时表现下降,因此需要专门的数据集和学习方法来解决这一问题。 Method: 提出了CaMiT数据集,包含78.7万个标注样本和510万个无标注样本(2005-2023年);设计了时间增量分类任务,评估了时间增量预训练(更新主干网络)和时间增量分类器学习(仅更新最后一层)两种策略;探索了利用时间元数据进行时间感知图像生成的方法。 Result: 在域内静态预训练下,模型性能可媲美大规模通用模型且更高效;两种时间增量学习策略均提升了模型对时间变化的鲁棒性;时间感知生成模型能产生更真实的结果。 Conclusion: CaMiT为研究细粒度视觉识别与生成中的时间适应问题提供了丰富的基准,推动了面向动态环境的持续学习研究。 Abstract: AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.[266] Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives
Zexian Huang,Mashnoon Islam,Brian Armstrong,Kourosh Khoshelham,Martin Tomko
Main category: cs.CV
TL;DR: 本文提出了一种名为DINO-CV的自监督分割框架,利用高分辨率LiDAR衍生的数字高程模型(DEM)自动绘制低矮干石墙,解决了植被遮挡和标注数据稀缺的问题,在Budj Bim地区取得了良好的效果。
Details
Motivation: 干石墙具有重要的遗产和环境价值,但在澳大利亚许多墙体因难以接近和人工测绘成本高而未被识别,亟需一种高效、可扩展的自动化测绘方法。 Method: 提出DINO-CV框架,采用基于知识蒸馏的自监督跨视角预训练策略,利用多种DEM衍生物学习不变的视觉与几何特征,并支持多种视觉骨干网络,在高分辨率DEM上进行干石墙分割。 Result: 在测试区域达到68.6%的mIoU,仅用10%标注数据微调时仍保持63.8%的mIoU,成功识别出澳大利亚最密集的殖民时期干石墙群之一。 Conclusion: DINO-CV展示了在植被覆盖且标注稀少的遗产环境中,利用自监督学习和高分辨率DEM进行干石墙自动测绘的可行性和潜力。 Abstract: Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia's densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.[267] Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
Sébastien Thuau,Siba Haidar,Ayush Bajracharya,Rachid Chelouah
Main category: cs.CV
TL;DR: 本文比较了两种节能的联邦学习方法用于暴力检测:视觉-语言模型的零样本与微调,以及紧凑3D卷积神经网络的个性化训练。结果显示两者准确率均超过90%,其中CNN3D在能效和性能上略优于LoRA微调的VLMs,而VLMs在上下文推理方面更具优势。研究提出了轻量级CNN与选择性VLM结合的混合模型,为资源敏感型视频监控AI提供了可复现的基准。
Details
Motivation: 为了在资源受限和非独立同分布(non-IID)环境下实现高效、可持续的暴力检测,探索低能耗且高性能的联邦学习方案。 Method: 采用两种策略:一是对LLaVA-7B等视觉-语言模型进行零样本和LoRA联邦微调;二是个性化训练一个6580万参数的3D CNN模型,在联邦设置下评估其准确性、校准性和能耗表现。 Result: 两种方法准确率均超90%;CNN3D在ROC AUC和log loss上略优于LoRA-tuned VLM,并且能耗更低;VLM在多模态推理和情境理解方面表现更优;研究还量化了训练与推理阶段的能耗及碳排放。 Conclusion: 提出一种混合框架:常规场景使用轻量级CNN进行分类,复杂或需描述的场景选择性激活VLM,兼顾效率与智能推理,为绿色AI驱动的视频监控提供可持续、可复现的解决方案。 Abstract: We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.[268] 4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
Ling Liu,Jun Tian,Li Yi
Main category: cs.CV
TL;DR: 本文提出了4DSegStreamer,一种用于流式4D全景分割的双线程框架,能够实时处理动态环境中的连续帧,具有高效性和鲁棒性。
Details
Motivation: 在高动态环境中(如密集人群疏散和复杂场景下的自动驾驶),需要在有限时间内实现细粒度、实时的感知,现有方法难以满足流式处理和高帧率下的性能需求。 Method: 提出Dual-Thread System,包含预测线程和推理线程:预测线程利用历史运动和几何信息提取特征并预测未来动态;推理线程通过内存对齐及补偿自运动和动态物体运动,确保及时预测。该框架可集成到现有3D/4D分割方法中。 Result: 在HOI4D、SemanticKITTI和nuScenes数据集上验证了方法的有效性,尤其在高FPS条件下表现出优越的鲁棒性,并能准确预测复杂场景中的动态对象。 Conclusion: 4DSegStreamer实现了高效的流式4D全景分割,具备良好的通用性和实时性能,显著提升了高动态场景下的感知能力。 Abstract: 4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.[269] PICABench: How Far Are We from Physically Realistic Image Editing?
Yuandong Pu,Le Zhuo,Songhao Han,Jinbo Xing,Kaiwen Zhu,Shuo Cao,Bin Fu,Si Liu,Hongsheng Li,Yu Qiao,Wenlong Zhang,Xi Chen,Yihao Liu
Main category: cs.CV
TL;DR: 本文提出了PICABench,一个用于评估图像编辑中物理真实性的新基准,涵盖光学、力学和状态转换等八个维度,并提出PICAEval评估协议和PICA-100K训练数据集,推动图像编辑向物理一致的真实感发展。
Details
Motivation: 现有图像编辑模型和基准主要关注指令完成,忽视了编辑后应伴随的物理效应(如阴影、反射等),缺乏对物理真实性的系统评估。 Method: 提出PICABench基准,涵盖八种物理子维度和常见编辑操作;设计基于视觉语言模型评判的PICAEval评估协议,结合逐案例、区域级人工标注;通过从视频中学习物理规律构建PICA-100K训练数据集。 Result: 对主流模型的评估表明,当前图像编辑在物理真实性方面仍有显著不足,存在大量改进空间。 Conclusion: 物理真实性是图像编辑的关键挑战,PICABench和PICA-100K为未来实现物理一致的图像编辑提供了基础和方向。 Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.[270] Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model
Xinwei Zhang,Hu Chen,Zhe Yuan,Sukun Tian,Peng Feng
Main category: cs.CV
TL;DR: 本文提出了一种基于智能通信混合专家系统的医学图像分割基础模型IC-MoE,通过构建多种专家模块和语义引导的对比学习方法,在增强高层特征表示的同时保持预训练权重的结构完整性,显著提升了分割性能与泛化能力。
Details
Motivation: 现有自然图像分割基础模型在医学图像任务中微调时存在高层特征表达不足和破坏预训练权重结构完整性的问题。 Method: 1) 构建基础专家、语义专家和自适应专家,并采用像素概率自适应投票策略进行专家选择与融合;2) 提出语义引导的对比学习方法,缓解对比学习中的弱监督问题。 Result: 在三个公开医学图像分割数据集上,IC-MoE优于其他SOTA模型,表现出更强的高层特征表达能力和结构保持性。 Conclusion: IC-MoE有效增强了医学图像分割基础模型的高层特征表示并保留了预训练结构完整性,具有优异的泛化性能。 Abstract: Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.[271] Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang
Main category: cs.CV
TL;DR: 本文提出了一个多语言文本到图像人物检索(TIPR)新任务,并构建了相应的基准。为此,作者利用大语言模型结合领域知识构建多语言数据集,并提出Bi-IRRA框架,通过双向隐式关系推理和多维全局对齐模块实现跨模态、跨语言的细粒度对齐,在多个多语言TIPR数据集上实现了最先进的性能。
Details
Motivation: 现有TIPR方法主要关注英文且依赖全局或局部显式对齐策略,忽略了细粒度跨模态差异或多语言应用限制,因此需要一种能同时处理多语言环境和模态异质性的方法。 Method: 提出Bi-IRRA框架,包含双向隐式关系推理模块(通过掩码图像和文本的双向预测隐式建模局部关系)和多维全局对齐模块(缓解模态异质性),并在自建的多语言TIPR基准上进行训练与评估。 Result: 在所有多语言TIPR数据集上均取得了新的最先进性能,验证了方法的有效性。 Conclusion: Bi-IRRA有效解决了多语言TIPR中的跨模态与跨语言对齐问题,推动了该任务在多语言场景下的发展,具备较强的实用性和扩展性。 Abstract: Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.[272] Towards 3D Objectness Learning in an Open World
Taichi Liu,Zhenyu Wang,Ruofeng Liu,Guang Wang,Desheng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需文本提示的开放世界3D检测器OP3Det,通过融合2D语义先验和3D几何先验以及跨模态专家混合机制,实现了对3D场景中已知和未知物体的广义检测,在性能上显著优于现有方法。
Details
Motivation: 现有的3D目标检测方法在封闭集设定下难以泛化到开放世界场景,而现有的开放词汇模型受限于词汇扩展和语义重叠问题,缺乏对广义3D物体性的有效学习。 Method: 提出OP3Det,一种类无关、无需提示的开放世界3D检测器,利用2D基础模型的强泛化与零样本能力,结合2D语义先验和3D几何先验生成类无关候选框,并通过点云与RGB图像的跨模态专家混合机制动态融合单模态与多模态特征。 Result: 实验表明,OP3Det在AR指标上比现有开放世界3D检测器最高提升16.0%,相比封闭世界检测器提升13.5%,展现出卓越的开放世界3D物体发现能力。 Conclusion: OP3Det有效实现了开放世界下的广义3D物体检测,无需依赖手工设计的文本提示,为未来开放环境中的3D感知提供了新思路。 Abstract: Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.[273] GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
Aleksandr Oganov,Ilya Bykov,Eva Neudachina,Mishan Aliev,Alexander Tolmachev,Alexander Sidorov,Aleksandr Zuev,Andrey Okhotin,Denis Rakitin,Aibek Alanov
Main category: cs.CV
TL;DR: 提出了一种称为广义对抗求解器(GAS)的新方法,通过简化ODE采样器参数化和结合对抗训练,提升了扩散模型在少步生成中的质量和细节保真度。
Details
Motivation: 扩散模型虽然生成质量高,但采样计算代价大;现有加速方法常依赖复杂训练技巧且难以保持细节,因此需要一种简单有效、能保留精细细节的快速采样方法。 Method: 提出广义求解器结构,采用简化的ODE采样器参数化方式,并结合原始蒸馏损失与对抗训练来提升生成细节和减少伪影。 Result: 在相似资源约束下,GAS优于现有的求解器训练方法,在生成质量与细节保真度方面表现更优。 Conclusion: 广义对抗求解器(GAS)是一种无需复杂训练技巧的高效采样方法,通过简单参数化和对抗训练显著提升了少步扩散模型的生成质量。 Abstract: While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS.[274] Elastic ViTs from Pretrained Models without Retraining
Walter Simoncini,Michael Dorkenwald,Tijmen Blankevoort,Cees G. M. Snoek,Yuki M. Asano
Main category: cs.CV
TL;DR: SnapViT提出了一种无需重训练、无需标签的单次结构化剪枝方法,实现Vision Transformers在连续计算预算下的弹性推理。
Details
Motivation: 现有的视觉基础模型尺寸固定,难以适应实际部署中的多样化计算资源需求,导致效率低下。 Method: 结合梯度信息与跨网络结构相关性,通过进化算法近似Hessian非对角结构,并设计自监督重要性评分机制,在无标签数据下完成剪枝。 Result: 在DINO、SigLIPv2、DeIT和AugReg等模型上优于现有方法,仅用单A100 GPU不到5分钟即可生成可弹性调整的模型。 Conclusion: SnapViT实现了高效、通用且快速的Vision Transformer剪枝,支持灵活部署并保持高性能。 Abstract: Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/[275] Improving Cross-Patient Generalization in Parkinson's Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns
Mhd Adnan Albani,Riad Sonbol
Main category: cs.CV
TL;DR: 提出一种基于手绘图像分块处理和集成学习的帕金森病检测新方法,在已见和未见患者数据上均表现出优越性能。
Details
Motivation: 现有帕金森病检测方法存在数据集不足和对未见患者数据鲁棒性差的问题。 Method: 采用两阶段方法:首先按绘制类型分类,然后将图像分为2x2块提取特征,并通过集成方法融合各块的决策进行最终分类。 Result: 在NewHandPD数据集上,对已见患者达到97.08%准确率,对未见患者达到94.91%准确率,性能下降仅2.17个百分点,优于先前工作。 Conclusion: 所提方法有效提升了帕金森病检测的准确性和对未见患者的鲁棒性,具有临床应用潜力。 Abstract: Parkinson's disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson's disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson's disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson's disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson's disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.[276] Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging
Suqiang Ma,Subhadeep Sengupta,Yao Lee,Beikang Gu,Xianyan Chen,Xianqiao Wang,Yang Liu,Mengjia Xu,Galit H. Frydman,He Li
Main category: cs.CV
TL;DR: 本研究提出了一种用于分析循环血细胞簇(CCC)图像的自动化计算框架,结合YOLOv11模型和多通道荧光染色叠加技术,实现了高精度的细胞簇分类与细胞类型识别。
Details
Motivation: 现有机器学习方法主要针对单细胞流式细胞图像分析,缺乏对形态不规则、异质性高的细胞簇图像的自动分析工具,限制了其在血栓、感染等疾病中的应用。 Method: 采用两步分析策略:首先利用微调后的YOLOv11模型对图像进行细胞簇与非细胞簇分类;然后通过将细胞簇轮廓与多通道荧光染色区域叠加来识别簇内细胞类型。 Result: 该框架在细胞簇分类和表型识别上均达到95%以上的准确率,优于传统CNN和Vision Transformer模型。 Conclusion: 所提出的自动化框架能有效整合明场与荧光数据,准确分析循环血细胞簇,具有扩展应用于免疫细胞和肿瘤细胞簇研究的潜力。 Abstract: Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.[277] Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions
Zhiqiang Teng,Beibei Lin,Tingting Chen,Zifeng Yuan,Xuanyi Li,Xuanyu Zhang,Shunli Zhang
Main category: cs.CV
TL;DR: 本文提出了RaindropGS,一个针对雨滴条件下3D高斯点阵(3DGS)重建的综合基准,涵盖从受污染图像到清晰重建的完整流程评估。
Details
Motivation: 现有3DGS在雨滴导致镜头遮挡和光学畸变时性能下降,且合成数据与真实雨滴存在域差距,难以准确评估实际场景下的表现。 Method: 构建包含真实雨滴数据的数据集,包含雨滴聚焦、背景聚焦和无雨真值图像;设计完整的评估流程,包括数据准备、处理和雨滴感知的3DGS评估,分析不同模块(如位姿估计、点云初始化、去雨方法)的影响。 Result: 揭示了当前3DGS方法在非约束雨滴图像上的性能局限,明确了相机焦点位置、位姿与点云初始化误差对重建质量的影响。 Conclusion: RaindropGS为开发更鲁棒的雨滴环境下3DGS方法提供了清晰方向和可靠基准。 Abstract: 3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.[278] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu
Main category: cs.CV
TL;DR: 本文提出了MT-Video-Bench,一个用于评估多模态大语言模型在多轮视频对话中理解能力的综合性基准,涵盖987个精心策划的多轮对话,聚焦感知与交互能力,并揭示现有模型在此类任务中的局限性。
Details
Motivation: 现有的多模态大语言模型评估基准大多局限于单轮问答,未能覆盖真实场景中复杂的多轮对话需求,因此需要一个更贴近实际应用的评估基准。 Method: 构建了一个名为MT-Video-Bench的新基准,包含来自多个领域的987个多轮对话,评估模型在六项核心能力上的表现,特别是感知和交互能力,并对多种先进的开源和闭源MLLM进行广泛评测。 Result: 实验表明现有MLLM在处理多轮视频对话时存在显著性能差异和局限性,特别是在互动性任务中表现不足。 Conclusion: MT-Video-Bench能有效评估MLLM在真实多轮视频对话场景中的表现,有助于推动未来多模态对话系统的发展,该基准将公开以促进研究。 Abstract: The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.[279] Signature Forgery Detection: Improving Cross-Dataset Generalization
Matheus Ramos Parracho
Main category: cs.CV
TL;DR: 本研究探讨了离线签名验证中的特征学习策略,旨在提升模型在不同数据集间的泛化能力。使用三个公开基准数据集(CEDAR、ICDAR 和 GPDS Synthetic),构建了基于原始图像和壳预处理的两种实验流程。结果显示,原始图像模型整体性能更优,而壳预处理方法具有进一步改进的潜力。
Details
Motivation: 现有深度学习方法在跨数据集签名验证中泛化能力不足,受书写风格和采集方式差异影响较大,亟需提升模型鲁棒性。 Method: 采用三种公开数据集(CEDAR、ICDAR、GPDS Synthetic),构建两种实验流程:一种直接使用原始签名图像,另一种采用壳预处理方法进行特征增强,比较其在跨数据集场景下的表现。 Result: 原始图像模型在多个基准测试中表现更优,具备更强的跨数据集性能;壳预处理模型虽未表现出明显优势,但显示出在跨域签名验证中的潜在价值。 Conclusion: 直接使用原始图像的深度学习模型在跨数据集签名验证中更具优势,而壳预处理方法仍具发展潜力,未来可通过优化进一步提升其鲁棒性和泛化能力。 Abstract: Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.[280] Can Image-To-Video Models Simulate Pedestrian Dynamics?
Aaron Appelle,Jerome P. Lynch
Main category: cs.CV
TL;DR: 研究基于扩散变换器(DiT)的图像到视频模型是否能在拥挤公共场景中生成逼真的行人运动模式。
Details
Motivation: 探索当前高性能I2V模型在真实世界场景中的行人运动建模能力,尤其是在复杂人群动态下的表现。 Method: 通过从行人轨迹基准中提取关键帧来条件化I2V模型,并使用行人动力学的定量指标评估其轨迹预测性能。 Result: 验证了这些模型在生成逼真行人运动方面的潜力,提供了对现有I2V模型世界建模能力的深入分析。 Conclusion: 基于DiT的I2V模型在适当条件下能够生成符合真实行人动态的视频,显示出其作为行人行为模拟工具的应用前景。 Abstract: Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.[281] Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition
Timur Ismagilov,Shakaiba Majeed,Michael Milford,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn,Shoaib Ehsan
Main category: cs.CV
TL;DR: 提出一种无需训练、与描述符无关的多参考视觉位置识别方法,通过矩阵分解建模多个参考描述符,实现基于投影的残差匹配,在多种数据上显著提升召回率。
Details
Motivation: 现有方法在处理外观和视角变化时受限于计算成本或依赖启发式策略,难以有效利用多参考数据提升定位性能。 Method: 采用矩阵分解将多个参考描述符分解为基础表示,进行描述符级融合,实现无需训练的投影-based残差匹配,且不依赖特定描述符类型。 Result: 在多外观数据上Recall@1提升约18%,在无结构数据上提升约5%,优于单参考及多参考基线方法,具有强泛化性和轻量性。 Conclusion: 所提方法在不增加训练开销的前提下,有效提升多参考VPR的鲁棒性和准确性,适用于复杂环境下的视觉定位。 Abstract: We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.[282] Towards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion
Md. Enamul Atiq,Shaikh Anowarul Fattah
Main category: cs.CV
TL;DR: 提出一种基于双编码器注意力机制的皮肤癌分类框架,结合病变分割和临床元数据,提升准确性和可解释性。
Details
Motivation: 皮肤癌早期检测至关重要,但现有深度学习模型因高类内变异性和低类间差异导致分类困难,且缺乏可解释性,限制了临床应用。 Method: 采用带有双重注意力门(DAG)和空洞空间金字塔池化(ASPP)的Deep-UNet进行病变分割;分类阶段使用两个DenseNet201编码器,分别处理原始图像和分割结果,并通过多头交叉注意力融合特征;引入基于Transformer的模块整合患者年龄、性别和病灶位置等元数据。 Result: 在HAM10000、ISIC 2018和2019数据集上达到最先进的分割性能,显著提升分类准确率和平均AUC;通过Grad-CAM可视化验证模型关注病灶区域,减少对背景噪声的依赖。 Conclusion: 结合精确病变分割与临床元数据,并利用注意力机制进行特征融合,可有效提高皮肤癌自动诊断模型的准确性与可解释性,增强临床可信度。 Abstract: Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.[283] SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Samir Khaki,Junxian Guo,Jiaming Tang,Shang Yang,Yukang Chen,Konstantinos N. Plataniotis,Yao Lu,Song Han,Zhijian Liu
Main category: cs.CV
TL;DR: SparseVILA提出了一种高效的视觉语言模型推理新范式,通过在预填充和解码阶段解耦视觉稀疏性,实现显著加速同时保持甚至提升性能。
Details
Motivation: 现有视觉语言模型因视觉token数量增长导致推理延迟高,可扩展性受限,亟需高效的推理方法。 Method: 提出SparseVILA,将稀疏性分布于预填充和解码阶段:预填充时剪枝冗余视觉token,解码时仅检索与查询相关的token,并结合AWQ优化的推理流水线。 Result: 在长上下文视频任务中,预填充速度提升最多4.0倍,解码快2.5倍,端到端提速2.6倍,同时在文档理解和推理任务上准确率提高。 Conclusion: SparseVILA通过解耦查询无关剪枝与查询感知检索,提供了一种无需训练、架构无关的高效多模态推理框架,显著加速大模型推理而不损失能力。 Abstract: Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.[284] ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
Zixin Yin,Ling-Hao Chen,Lionel Ni,Xili Dai
Main category: cs.CV
TL;DR: 本文提出了一种针对MM-DiT架构的新型注意力控制方法ConsistEdit,实现了在图像和视频编辑中更强的一致性和编辑能力。