Table of Contents
cs.CV [Back]
[1] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj
Main category: cs.CV
TL;DR: 本文提出了一种基于Stable Diffusion、GPT-2和混合音频管道的60秒电影自动生成方法,结合线性帧插值和后期处理,实现了高质量的视频合成。
Details
Motivation: 生成式人工智能的进步推动了多媒体创作的变革,本文旨在通过文本输入自动生成高质量的电影视频,满足创意、教育和工业需求。 Method: 采用五场景框架,结合Stable Diffusion生成高保真图像,GPT-2构建叙事结构,混合音频管道(gTTS和YouTube音乐)处理声音,并通过线性帧插值和后期处理优化视频质量。 Result: 实验结果表明,该方法在视觉质量、叙事连贯性和效率方面表现优异,支持1024x768分辨率和15-30 FPS帧率。 Conclusion: 该方法进一步推动了文本到视频合成技术的发展,适用于多种应用场景。 Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.[2] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue
Main category: cs.CV
TL;DR: 提出了一种基于掩码的LoRA调优方法,用于灵活的视频编辑,解决了现有方法依赖大规模预训练和缺乏后续帧控制的问题。
Details
Motivation: 当前视频编辑方法依赖大规模预训练,灵活性不足,且首帧引导编辑对后续帧控制有限。 Method: 采用掩码驱动的LoRA调优策略,结合输入视频和参考图像,动态调节模型注意力以实现区域特定学习。 Result: 实验结果表明,该方法在视频编辑性能上优于现有技术。 Conclusion: 该方法提供了一种高效且灵活的视频编辑解决方案,无需改变模型架构。 Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.[3] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding
Bin Guo,John H. L. Hansen
Main category: cs.CV
TL;DR: DeepTraverse是一种新型视觉架构,受算法搜索策略启发,通过系统化探索和自适应校准模块构建特征,显著提升了分类准确性和特征判别能力。
Details
Motivation: 传统视觉主干网络的特征构建过程缺乏自适应迭代优化路径,研究探索是否可以通过引入算法搜索原则,实现更结构化、逻辑化的特征处理流程。 Method: DeepTraverse采用递归探索模块和自适应校准模块,前者通过参数共享深化特征分析,后者动态调整特征显著性。 Result: 在多个图像分类基准测试中,DeepTraverse表现出色,分类准确性和特征判别能力优于传统模型。 Conclusion: 研究表明,引入算法先验是构建高效、高性能且结构化视觉主干网络的有效策略。 Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.[4] Test-Time Adaptation for Generalizable Task Progress Estimation
Christos Ziakas,Alessandra Russo
Main category: cs.CV
TL;DR: 提出一种测试时适应方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。
Details
Motivation: 解决进度估计模型在多样化任务、环境和体现中的泛化问题,超越现有基于上下文学习的方法。 Method: 采用基于梯度的元学习策略,训练模型利用专家视觉轨迹和自然语言任务描述,测试时适应依赖语义内容而非时间顺序。 Result: 方法在多样化分布外任务、环境和体现中表现优异,优于现有自回归视觉语言模型方法。 Conclusion: 测试时适应方法通过语义内容优化,显著提升了进度估计的泛化能力。 Abstract: We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.[5] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang
Main category: cs.CV
TL;DR: EfficientVLA是一个无需训练的高效推理加速框架,通过协同利用多方面的冗余性,显著提升了Vision-Language-Action(VLA)模型的推理速度和计算效率。
Details
Motivation: 现有的VLA模型(如基于扩散的架构)在计算和内存需求上存在严重冗余,限制了其实际部署能力。现有加速方法通常只针对局部低效问题,未能全面解决整个VLA流程中的瓶颈。 Method: EfficientVLA采用三种策略:1)基于层间冗余分析剪枝语言模块;2)任务感知的视觉标记选择优化视觉处理;3)通过缓存和重用中间特征减少扩散动作头的计算冗余。 Result: 在CogACT模型上应用EfficientVLA后,推理速度提升1.93倍,计算量降至28.9%,且性能仅下降0.6%。 Conclusion: EfficientVLA通过系统性冗余消除,显著提升了VLA模型的效率和实用性,为实际部署提供了可行方案。 Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.[6] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild
Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso
Main category: cs.CV
TL;DR: 论文提出了一个多模态环境下检测未成年人内容的数据集ICCWD,填补了现有空白,并验证了其有效性。
Details
Motivation: 现有工具缺乏多模态环境下检测未成年人内容的数据集和基准,因此需要填补这一空白。 Method: 发布了包含10,000张图像-标题对的ICCWD数据集,并测试了三种检测器的性能。 Result: 最佳检测方法的真阳性率为75.3%,表明未成年人检测任务具有挑战性。 Conclusion: ICCWD数据集有望帮助设计更好的未成年人检测方法。 Abstract: Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.[7] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers
Natanael Lucena,Fábio S. da Silva,Ricardo Rios
Main category: cs.CV
TL;DR: 比较了CNN和ViT在银屑病及类似疾病图像分类任务中的表现,ViT表现更优,尤其是DaViT-B模型,F1分数达96.4%。
Details
Motivation: 探索CNN和ViT在医学图像分类任务中的性能差异,特别是针对银屑病检测。 Method: 使用预训练于ImageNet的CNN和ViT模型,适配特定数据集进行多分类任务。 Result: ViT表现优于CNN,尤其是DaViT-B模型,F1分数达96.4%。 Conclusion: ViT在医学图像分类中潜力显著,DaViT-B是银屑病自动检测的高效架构。 Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.[8] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang
Main category: cs.CV
TL;DR: ViCrit任务通过训练视觉语言模型定位人工注入的视觉幻觉,提升视觉感知能力,并在多个基准测试中表现优异。
Details
Motivation: 解决视觉语言模型在视觉感知任务中缺乏挑战性且明确可验证任务的难题。 Method: 引入ViCrit任务,通过注入细微视觉描述错误并训练模型定位错误,提供明确的二元奖励。 Result: 模型在多种视觉语言基准测试中表现显著提升,且能力可迁移至抽象图像推理和视觉数学任务。 Conclusion: ViCrit任务是一种有效且通用的方法,可显著提升视觉语言模型的视觉感知能力。 Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.[9] RoCA: Robust Cross-Domain End-to-End Autonomous Driving
Rajeev Yasarla,Shizhong Han,Hsin-Pai Cheng,Litian Liu,Shweta Mahajan,Apratim Bhattacharyya,Yunxiao Shi,Risheek Garrepalli,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: RoCA是一个用于跨域端到端自动驾驶的框架,通过联合概率分布建模和Gaussian process学习基础令牌,提升模型泛化性和适应性。
Details
Motivation: 解决跨域部署的挑战,避免大型语言模型的高成本重训练需求。 Method: 利用Gaussian process学习基础令牌及其轨迹,概率推断未来轨迹,结合基础模型提升泛化性。 Result: 在多种跨域场景中表现优异,显著优于直接微调。 Conclusion: RoCA在跨域泛化和适应方面表现强劲,无需额外推理计算。 Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.[10] SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score
Mohammad Jalali,Haoyu Lei,Amin Gohari,Farzan Farnia
Main category: cs.CV
TL;DR: 论文提出了一种名为SPARKE的方法,通过条件熵实现提示感知的多样性引导,解决了提示引导扩散模型中生成样本多样性不足的问题,并显著降低了计算复杂度。
Details
Motivation: 提示引导扩散模型在生成样本时难以保证多样性,尤其是在提示语义广泛且需要跨相似提示评估多样性的情况下。 Method: SPARKE方法利用条件熵进行多样性引导,动态地将多样性测量与相似提示关联,并通过条件潜在RKE分数引导将计算复杂度从O(n^3)降低到O(n)。 Result: 实验表明,SPARKE在多个文本到图像扩散模型中显著提升了生成数据的提示感知多样性,且未显著增加计算成本。 Conclusion: SPARKE方法有效解决了提示感知多样性问题,同时通过优化计算复杂度实现了大规模生成场景的实用性。 Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE[11] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context
Yael Frischholz,Devis Tuia,Michael Lehning
Main category: cs.CV
TL;DR: 论文提出了一种基于注意力机制的模拟器,用于从卫星图像序列中学习推断晴空地表反射率,解决了传统方法在山区因雪盖变化而失效的问题。
Details
Motivation: 传统方法依赖月度统计估算背景反射率,在山区因雪盖动态变化而失效,因此需要一种能自动学习地表反射率动态的方法。 Method: 采用基于时空视觉变换器的注意力模拟器,无需手工特征(如反照率图或云掩膜),输入多光谱卫星图像、静态地形特征和太阳几何数据,训练目标是HelioMont算法的SSR估计。 Result: 模型在提供足够长的时间上下文时,性能与基于反照率的方法相当,尤其在山区表现突出,提升了复杂地形下的泛化能力。 Conclusion: 该方法能够内部学习地表反射率动态,适用于复杂地形和动态雪盖区域,代码和数据已开源。 Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git[12] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias
Main category: cs.CV
TL;DR: 论文提出了一种高效探测(EP)方法,通过多查询交叉注意力机制优化了注意力探测的性能和效率。
Details
Motivation: 由于线性探测(LP)无法充分评估掩码图像建模(MIM)模型的潜力,且现有注意力探测方法存在参数冗余和计算效率低的问题,需要一种更高效的探测方法。 Method: 引入高效探测(EP),采用多查询交叉注意力机制,减少冗余投影和可训练参数,提升计算效率。 Result: EP在七个基准测试中优于LP和现有注意力探测方法,泛化能力强,生成可解释的注意力图,并在低样本和分层设置中表现优异。 Conclusion: 高效探测(EP)是一种简单但高效的注意力探测方法,显著提升了性能与效率的平衡。 Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.[13] Improving Personalized Search with Regularized Low-Rank Parameter Updates
Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell
Main category: cs.CV
TL;DR: 本文提出了一种通过正则化低秩适配方法,在视觉-语言双编码器模型中高效适应个性化视觉-语言检索任务的方法,显著提升了新概念的识别能力。
Details
Motivation: 个性化视觉-语言检索需要从少量示例中学习新概念(如“我的狗Fido”),并整合个人与通用知识以在不同上下文中识别该概念。 Method: 采用语言编码器最后一层的小参数集进行正则化低秩适配,替代文本反转,同时探索多个人概念参数的组合策略(如参数加法)。 Result: 在两个个性化图像检索基准(DeepFashion2和ConCon-Chi)上,方法比现有技术提升了4%-22%的检索准确率。 Conclusion: 正则化低秩适配方法在保留通用知识的同时,显著提升了个性化概念的识别能力,为个性化视觉-语言检索提供了高效解决方案。 Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.[14] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators
Parsa Rahimi,Sebastien Marcel
Main category: cs.CV
TL;DR: ScoreMix是一种基于扩散模型的数据增强策略,通过混合不同类别的分数生成挑战性样本,显著提升判别器性能,尤其在标注数据有限的情况下。
Details
Motivation: 解决标注数据有限时判别器性能不足的问题,利用扩散模型的分数组合特性生成高质量合成样本。 Method: 在扩散采样过程中,通过凸组合不同类别的分数生成合成样本,并研究类别选择策略对性能的影响。 Result: ScoreMix在所有测试基准中显著提升判别器性能,且发现混合判别器嵌入空间中距离较远的类别效果更好。 Conclusion: ScoreMix无需大量参数搜索即可显著提升性能,为训练判别模型提供了一种实用且高效的方法。 Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator's embedding space, rather than close in the generator's condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator's learned condition space and the discriminator's embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix[15] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops
Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles
Main category: cs.CV
TL;DR: 该研究提出了一个综合的加州农作物产量基准数据集,结合多源数据开发了一种多模态深度学习模型,用于县级作物产量预测,预测性能优异(R2=0.76)。
Details
Motivation: 尽管美国农业部提供了丰富的历史产量数据,但由于环境、气候和土壤因素的复杂交互,准确及时的作物产量预测仍具挑战性。 Method: 整合了Landsat卫星影像、气候记录、蒸散发数据和高分辨率土壤特性,开发了多模态深度学习模型,结合分层特征提取和时间序列编码。 Result: 模型在未见测试数据集上的R2得分为0.76,表现出强大的预测性能。 Conclusion: 该基准和模型框架为农业预测、气候适应和精准农业提供了重要基础,数据集和代码已公开。 Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.[16] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: DySS是一种基于状态空间学习和动态查询的3D目标检测方法,通过动态更新查询和辅助任务训练,实现了高效且高性能的检测。
Details
Motivation: 现有稀疏查询方法在视频帧增多时计算成本高,需更高效的检测方法。 Method: 利用状态空间模型(SSM)处理时序特征,引入未来预测和掩码重建辅助任务,动态更新查询(合并、移除、拆分)。 Result: 在nuScenes测试集上达到65.31 NDS和57.4 mAP,验证集上56.2 NDS和46.2 mAP,实时推理速度33 FPS。 Conclusion: DySS在性能和效率上均优于现有方法,适用于自动驾驶感知任务。 Abstract: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.[17] HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park,Minyeong Kim,Gunhee Kim
Main category: cs.CV
TL;DR: HalLoc是一个用于高效概率性幻觉检测的数据集,包含150K标记级注释样本,支持视觉问答、指令跟随和图像字幕任务。
Details
Motivation: 解决现有幻觉检测方法计算成本高、延迟大且无法处理模糊边界的问题。 Method: 提出HalLoc数据集,并开发了一个低开销的基线模型,可在生成过程中实时检测幻觉。 Result: HalLoc数据集和基线模型为幻觉检测提供了高效解决方案,并可无缝集成到现有视觉语言模型中。 Conclusion: HalLoc为提升视觉语言模型的可靠性开辟了新途径,其数据集和代码已公开。 Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.[18] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation
Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
Main category: cs.CV
TL;DR: 该研究评估了基于深度学习的皮肤病变分类方法,结合迁移学习和不确定性量化,发现CLIP-based视觉变换器性能最佳,集成方法在准确性和不确定性处理间取得平衡。
Details
Motivation: 皮肤癌诊断的准确性和可靠性对早期治疗至关重要,但现有深度学习模型受限于数据稀缺和缺乏不确定性意识。 Method: 研究分两阶段:1) 比较多种预训练特征提取器和传统分类器;2) 引入不确定性量化方法(如MCD和集成)评估模型可靠性。 Result: CLIP-based视觉变换器(如LAION CLIP ViT-H/14与SVM结合)表现最佳;集成方法在准确性和不确定性处理间表现均衡。 Conclusion: 在医学诊断中集成不确定性量化可提升模型性能和可信度,对临床应用具有重要意义。 Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.[19] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: 提出一种弱监督多模态框架,从有限输入(如病灶图像和稀疏临床文本)生成结构化SOAP笔记,减少对人工标注的依赖,性能媲美主流模型。
Details
Motivation: 皮肤癌是全球最常见癌症,手动生成SOAP笔记耗时且导致医生疲劳,需自动化解决方案。 Method: 采用弱监督多模态框架,结合图像和文本输入,减少标注需求。 Result: 性能与GPT-4o等模型相当,并引入MedConceptEval和CCS评估临床质量。 Conclusion: 该方法可扩展且减轻医生负担,减少对大量标注数据的依赖。 Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.[20] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
Fei Zhao,Da Pan,Zelu Qi,Ping Shi
Main category: cs.CV
TL;DR: 论文针对元宇宙中用户生成的全向视频(ODV)音频视觉质量评估(AVQA)研究不足的问题,构建了一个UGC全向A/V数据集,并提出了一个基线模型。
Details
Motivation: 随着元宇宙的兴起,全向视频(ODV)从专业生成内容(PGC)转向用户生成内容(UGC),但对其音频视觉质量评估(AVQA)的研究仍有限。 Method: 构建了一个包含300个视频的UGC全向A/V数据集,涵盖10种场景类型,并通过主观实验获取MOS评分。随后,提出了一个包含视频特征提取、音频特征提取和音视觉融合模块的基线模型。 Result: 实验结果表明,所提出的基线模型在数据集上表现最优。 Conclusion: 该研究填补了UGC-ODV AVQA领域的空白,并为未来研究提供了有效的数据集和基线模型。 Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.[21] Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
Deliang Wang,Chao Yang,Gaowei Chen
Main category: cs.CV
TL;DR: 研究探讨了视觉语言模型(VLMs)在零样本提示下分析学生学术情绪的潜力,发现Qwen2.5-VL-7B-Instruct表现优于Llama-3.2-11B-Vision-Instruct,尤其在识别困惑情绪方面。
Details
Motivation: 传统监督学习方法难以泛化,需反复数据收集和训练,而VLMs提供了一种无需微调的替代方案。 Method: 使用两种VLMs(Llama-3.2-11B-Vision-Instruct和Qwen2.5-VL-7B-Instruct)通过零样本提示分析5,000张学生面部表情图像。 Result: Qwen2.5-VL-7B-Instruct表现更优,尤其在识别困惑情绪方面;两种模型均擅长识别快乐情绪,但无法检测分心行为。 Conclusion: VLMs在学术情绪分析中具有潜力,尤其是Qwen2.5-VL-7B-Instruct在识别困惑情绪方面表现突出,适合实际应用。 Abstract: Students' academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students' academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students' happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students' confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.[22] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting
Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin
Main category: cs.CV
TL;DR: 提出了一种基于点特征感知的高斯泼溅框架,解决了3D高斯泼溅在稀疏视图输入下的过拟合问题,实现了实时高质量渲染。
Details
Motivation: 现有3D高斯泼溅方法需要大量校准视图,稀疏输入时易过拟合,导致渲染质量下降。 Method: 利用立体基础模型估计相机姿态和密集点云,通过多尺度2D特征采样和自注意力机制增强点间交互,最终通过MLP解码高斯参数。 Result: 在多样化基准测试中显著优于NeRF方法,并在少样本设置下与最先进的3DGS方法竞争。 Conclusion: 该方法有效提升了稀疏视图下的渲染质量,为3D高斯泼溅的实际应用提供了新思路。 Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.[23] GeoCAD: Local Geometry-Controllable CAD Generation
Zhanwei Zhang,Kaiyuan Liu,Junjie Liu,Wenxiao Wang,Binbin Lin,Liang Xie,Chen Shen,Deng Cai
Main category: cs.CV
TL;DR: GeoCAD是一种用户友好且局部几何可控的CAD生成方法,通过互补标注策略和LLM预测,实现高质量、有效的局部修改。
Details
Motivation: 现有方法无法同时满足文本指令跟随和局部修改需求,GeoCAD旨在解决这一问题。 Method: 采用互补标注策略(顶点和VLLM标注)生成几何指令,训练时随机掩码局部部分并用LLM预测。 Result: 实验证明GeoCAD在生成质量、有效性和文本-CAD一致性方面表现优异。 Conclusion: GeoCAD为局部几何可控的CAD生成提供了高效解决方案,代码已开源。 Abstract: Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.[24] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu
Main category: cs.CV
TL;DR: 论文提出了一种基于视觉语言模型的多模态研究框架,用于自动化和可扩展地分析城市街景风格差异,并通过实验验证了其有效性。
Details
Motivation: 由于地理、历史和社会政治因素的差异,城市文化和建筑风格在不同城市间存在显著差异。传统研究方法依赖专家解读和历史文献,难以标准化。 Method: 构建了UrbanDiffBench数据集,并开发了UrbanSense框架,基于视觉语言模型定量生成和比较城市风格表示。 Result: 实验结果显示,超过80%的生成描述通过t检验(p<0.05),主观评价的高Phi分数(城市0.912,时期0.833)验证了方法的有效性。 Conclusion: 该方法能够量化并解释城市风格的演变,为未来设计提供了科学依据。 Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method's ability to capture subtle stylistic differences. These results highlight the method's potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.[25] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration
Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu
Main category: cs.CV
TL;DR: 论文提出RealKeyMorph(RKM),一种分辨率无关的医学图像配准方法,避免传统方法因重采样引入的伪影。
Details
Motivation: 现有机器学习配准方法需固定分辨率重采样,可能引入伪影,限制了实际应用。 Method: RKM扩展KeyMorph框架,通过输出真实世界坐标的关键点,利用扫描仪仿射矩阵实现分辨率无关配准。 Result: 实验证明RKM在腹部MRI正交2D堆栈和不同分辨率3D脑数据集配准中表现优越。 Conclusion: RKM通过避免重采样,提供了一种更优的医学图像配准解决方案。 Abstract: Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.[26] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang
Main category: cs.CV
TL;DR: Motion-R1是一个结合Chain-of-Thought机制的运动-语言建模框架,通过分解复杂文本指令为逻辑动作路径,提升运动生成的语义理解和执行能力。
Details
Motivation: 现有文本到运动生成方法在语义对齐和运动合成方面取得进展,但缺乏对深层语言结构和逻辑推理的捕捉,导致生成的运动可控性、一致性和多样性不足。 Method: 提出Motion-R1框架,集成Chain-of-Thought机制,分解文本指令为逻辑动作路径,并采用Group Relative Policy Optimization算法联合优化推理链和运动合成。 Result: 在多个基准数据集上,Motion-R1表现优于现有方法,尤其在需要细致语义理解和长期时间一致性的场景中。 Conclusion: Motion-R1通过结构化推理显著提升了运动生成的质量和多样性,代码和模型将公开。 Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.[27] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device
Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVT是一种轻量级但高性能的人脸识别模型,结合了CNN-Transformer架构和创新的轻量级多头线性注意力机制(MHLA),在降低计算复杂度的同时保持高精度。
Details
Motivation: 解决资源受限平台上实时人脸识别的需求,通过轻量化和高效设计提升性能。 Method: 采用混合CNN-Transformer架构,结合MHLA机制和重新参数化的token mixer,优化计算效率。 Result: 在多个基准测试中表现优于现有轻量级模型,推理速度显著提升(比EdgeFace快8.6倍,比纯ViT模型快21.2倍)。 Conclusion: FaceLiVT为资源受限平台提供了高效、实用的实时人脸识别解决方案。 Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.[28] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu
Main category: cs.CV
TL;DR: 提出了一种名为FSATFusion的红外与可见光图像融合网络,通过频率-空间注意力Transformer模块(FSAT)和改进的Transformer模块(ITM)提升全局上下文捕捉能力,实验证明其优于现有方法。
Details
Motivation: 现有深度学习方法在红外与可见光图像融合中因卷积操作无法充分捕捉全局上下文而导致信息丢失,限制了融合性能。 Method: 设计了频率-空间注意力Transformer模块(FSAT)和改进的Transformer模块(ITM),以增强特征提取能力。 Result: 实验表明FSATFusion在融合质量和效率上优于其他先进方法,并展示了出色的泛化能力和在下游视觉任务中的优势。 Conclusion: FSATFusion通过创新的模块设计显著提升了红外与可见光图像融合的性能,具有广泛的应用潜力。 Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.[29] Revisiting Transformers with Insights from Image Filtering
Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen
Main category: cs.CV
TL;DR: 本文通过图像处理框架解释自注意力机制及其组件(如位置编码和残差连接),并提出两种改进方法,既提升可解释性又提高模型性能。
Details
Motivation: 自注意力机制虽成功但缺乏理论解释,现有框架未能深入理解其组件的作用。 Method: 开发统一的图像处理框架,解释自注意力及其组件,并引入两种改进方法。 Result: 改进方法不仅增强可解释性,还在语言和视觉任务中提高准确性和鲁棒性。 Conclusion: 图像处理框架为自注意力机制提供理论支持,同时改进方法展示了实际应用潜力。 Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.[30] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial
Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
Main category: cs.CV
TL;DR: 论文提出了一种名为PoseIDON的计算机视觉方法,结合深度基础模型特征和多视角摄影测量,用于估计海底物体的六自由度位姿及埋藏深度,验证结果显示平均误差约10厘米。
Details
Motivation: 研究旨在解决从遥感图像中准确估计海底物体埋藏深度的难题,以评估生态风险、污染物传输及危险材料回收策略的可行性。 Method: 采用PoseIDON计算机视觉流程,结合深度基础模型和多视角摄影测量,通过CAD模型对齐和局部平面拟合推断埋藏深度。 Result: 在54个物体的验证中,平均埋藏深度误差约为10厘米,并揭示了与沉积物传输过程相关的空间埋藏模式。 Conclusion: 该方法为海底埋藏的非侵入性测绘提供了可扩展方案,支持污染场地的环境评估。 Abstract: The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.[31] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba
Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin
Main category: cs.CV
TL;DR: 论文提出了一种动态自适应区域标记器(DART),通过自适应划分图像为不同大小的内容相关补丁,解决了固定大小补丁导致的背景区域过度编码和局部细节丢失问题。
Details
Motivation: 现有非卷积模型(如ViT和Vim)依赖固定大小补丁,导致背景区域过度编码和关键局部细节丢失。 Method: DART结合可学习区域评分和分段可微分分位数操作,自适应分配更密集的标记到信息丰富区域。 Result: DART在DeiT(ImageNet-1K)上准确率提升2.1%,同时减少45% FLOPs。 Conclusion: DART在多种模型上一致提升性能,且计算开销极小甚至减少。 Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.[32] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion
Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai
Main category: cs.CV
TL;DR: 论文提出ReconMOST框架,利用数据驱动的扩散模型进行多层海水温度重建,解决了传统方法数据稀疏和计算成本高的问题,并在缺失数据超过92.5%时仍保持高精度。
Details
Motivation: 准确重建海洋温度对全球气候动态和海洋气象研究至关重要,但传统方法受限于数据稀疏和计算复杂性,机器学习方法则多局限于表层或局部区域。 Method: 预训练无条件扩散模型学习历史数值模拟数据的物理一致性分布,生成阶段利用高精度观测数据引导反向扩散过程,实现精确重建。 Result: 在CMIP6和EN4数据上的实验显示,MSE值在引导、重建和总体上分别为0.049、0.680和0.633,验证了方法的有效性和鲁棒性。 Conclusion: ReconMOST扩展了机器学习在海洋温度重建中的应用,具备高精度、高分辨率和强泛化能力,为全球多层重建提供了新解决方案。 Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.[33] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie
Main category: cs.CV
TL;DR: Pisces是一种新型的多模态基础模型,通过解耦视觉编码架构和定制训练技术,在图像理解和生成任务中均表现优异。
Details
Motivation: 尽管多模态基础模型在图像理解和生成方面取得进展,但其性能仍不及专用模型。Pisces旨在解决视觉特征和训练过程差异带来的挑战。 Method: 采用解耦视觉编码架构和定制训练技术,结合数据筛选、预训练和微调。 Result: 在20多个图像理解基准测试和GenEval生成基准测试中表现优异。 Conclusion: Pisces展示了图像理解与生成的协同关系,推动多模态统一模型的发展。 Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.[34] It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations
Guoyi Zhang,Guangsheng Xu,Siyang Chen,Han Wang,Xiaohu Zhang
Main category: cs.CV
TL;DR: 提出了一种名为LRRNet的新型端到端红外小目标检测框架,利用红外图像背景的低秩特性,通过压缩-重建-减法的范式直接建模结构感知的低秩背景表示。
Details
Motivation: 红外小目标检测在复杂背景下面临低信噪比、目标形态多样和缺乏视觉线索的挑战,现有深度学习方法因目标内在变异性导致性能不稳定。 Method: 采用压缩-重建-减法(CRS)范式,直接在图像域建模低秩背景表示,无需基于块的处理或显式矩阵分解。 Result: 在多个公开数据集上,LRRNet在检测精度、鲁棒性和计算效率方面优于38种先进方法,平均速度达82.34 FPS。 Conclusion: LRRNet是首个直接通过端到端深度学习建模低秩背景结构的方法,在噪声环境下表现出强鲁棒性。 Abstract: Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.[35] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment
Shuo wang,Jihao Zhang
Main category: cs.CV
TL;DR: MF2Summ是一种基于多模态内容理解的视频摘要模型,结合视觉和听觉信息,通过五阶段流程实现高效摘要生成,性能优于现有方法。
Details
Motivation: 在线视频内容快速增长,传统单模态方法难以捕捉视频的完整语义,需要多模态方法提升摘要效果。 Method: MF2Summ采用五阶段流程:特征提取、跨模态注意力交互、特征融合、片段预测和关键镜头选择,结合视觉(GoogLeNet)和听觉(SoundNet)特征,使用Transformer建模模态间依赖关系。 Result: 在SumMe和TVSum数据集上,MF2Summ的F1分数分别比DSNet提高1.9%和0.6%,优于其他先进方法。 Conclusion: MF2Summ通过多模态融合显著提升了视频摘要性能,验证了跨模态方法的有效性。 Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.[36] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen
Main category: cs.CV
TL;DR: 提出了一种名为CIDer的新型鲁棒多模态情感识别框架,解决了模态缺失和分布外数据问题,通过自蒸馏和因果推理模块实现高效性能。
Details
Motivation: 现有方法在处理模态缺失和分布外数据时存在局限性,依赖特定模型或引入过多参数,实用性不足。 Method: CIDer包含模型特定自蒸馏模块(MSSD)和模型无关因果推理模块(MACI),分别通过自蒸馏和因果图提升鲁棒性和泛化能力。 Result: 实验表明CIDer在RMFM和OOD场景下表现优异,参数更少且训练更快。 Conclusion: CIDer为多模态情感识别提供了一种高效且鲁棒的解决方案,代码已开源。 Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.[37] Rethinking Generative Human Video Coding with Implicit Motion Transformation
Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye
Main category: cs.CV
TL;DR: 生成式视频编解码器通过隐式运动变换(IMT)提升人体视频压缩性能,解决了传统显式运动方法在复杂运动模式下的失真问题。
Details
Motivation: 传统显式运动方法在人体视频压缩中因复杂多样的运动模式导致重建失真和运动不准确,需探索更高效的方法。 Method: 提出IMT方法,将复杂人体信号转化为紧凑视觉特征,并通过隐式运动引导实现高质量重建。 Result: 实验证明IMT能显著提升生成式人体视频编解码器(GHVC)的压缩效率和重建保真度。 Conclusion: IMT为复杂运动模式下的视频压缩提供了有效解决方案,具有高效性和高保真度优势。 Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.[38] Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance
Chun Liu,Bingqian Zhu,Tao Xu,Zheng Zheng,Zheng Li,Wei Yang,Zhigang Han,Jiayao Wang
Main category: cs.CV
TL;DR: 该论文提出了一种增强高光谱图像分类模型对抗样本可迁移性的新方法,通过随机分块和特征距离损失设计,显著提升了攻击效果。
Details
Motivation: 高光谱图像(HSI)因其高维和丰富的光谱信息,对抗样本研究较少且难以充分利用图像结构特征,现有方法在可迁移性上存在不足。 Method: 方法包括随机分块以增加输入多样性,并设计特征距离损失,结合中间层特征和输出层预测指导扰动生成。 Result: 实验表明,生成的对抗样本在公开HSI数据集上对黑盒模型具有高效可迁移性,且在防御策略下仍保持攻击性能。 Conclusion: 该方法有效提升了对抗样本的可迁移性和攻击鲁棒性,为HSI分类安全提供了新思路。 Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.[39] Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization
Stone Yun,Alexander Wong
Main category: cs.CV
TL;DR: 研究探讨了深度神经网络(DNN)量化中权重初始化对量化鲁棒性的影响,并提出了一种基于图超网络(GHN)的新方法GHN-QAT,显著提升了量化模型的准确性。
Details
Motivation: 量化是降低机器学习模型推理成本的重要工具,但现有研究很少关注权重初始化对量化鲁棒性的影响。 Method: 通过分析不同权重初始化方法对CNN量化鲁棒性的影响,提出使用GHN预测量化DNN参数的方法(GHN-QAT)。 Result: GHN-QAT显著提升了4位量化的准确性,并在2位量化中表现优于随机初始化。 Conclusion: GHN-QAT为量化DNN模型设计提供了新思路,未来可结合量化感知训练进一步优化。 Abstract: Deep neural network (DNN) quantization for fast, efficient inference has been an important tool in limiting the cost of machine learning (ML) model inference. Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs. However, very little exploration has been done on improving the initial conditions of DNN training for quantization. Just as random weight initialization has been shown to significantly impact test accuracy of floating point models, it would make sense that different weight initialization methods impact quantization robustness of trained models. We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs. This analysis reveals that even with varying CNN architectures, the choice of random weight initializer can significantly affect final quantization robustness. Next, we explore a new method for quantization-robust CNN initialization -- using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs. Besides showing that GHN-predicted parameters are quantization-robust after regular float32 pretraining (of the GHN), we find that finetuning GHNs to predict parameters for quantized graphs (which we call GHN-QAT) can further improve quantized accuracy of CNNs. Notably, GHN-QAT shows significant accuracy improvements for even 4-bit quantization and better-than-random accuracy for 2-bits. To the best of our knowledge, this is the first in-depth study on quantization-aware DNN weight initialization. GHN-QAT offers a novel approach to quantized DNN model design. Future investigations, such as using GHN-QAT-initialized parameters for quantization-aware training, can further streamline the DNN quantization process.[40] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 提出了一种新的医学图像分割任务MedSeg-R,结合多模态大语言模型(MLLMs)的推理能力,实现基于复杂临床问题的精确分割。
Details
Motivation: 现有医学图像分割模型依赖显式指令且缺乏主动推理能力,限制了其在自动诊断中的应用。 Method: 提出MedSeg-R框架,包含全局上下文理解模块和像素级定位模块,结合MedSeg-QA数据集。 Result: 实验表明MedSeg-R在多个基准测试中表现优异,分割精度高且支持可解释的文本分析。 Conclusion: MedSeg-R为医学图像分割提供了新的解决方案,结合推理与分割能力,具有临床应用潜力。 Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.[41] LLMs Are Not Yet Ready for Deepfake Image Detection
Shahroz Tariq,David Nguyen,M. A. P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore
Main category: cs.CV
TL;DR: 研究评估了四种视觉语言模型(VLM)在零样本设置下对三类深度伪造(deepfake)的检测能力,发现其虽能生成合理解释但尚不可靠,但可作为辅助工具。
Details
Motivation: 深度伪造技术日益复杂,威胁媒体完整性和公众信任,而视觉语言模型(VLM)因其多领域潜力被探索用于深度伪造检测。 Method: 通过结构化零样本评估四种VLM(ChatGPT、Claude、Gemini、Grok),使用包含真实与伪造图像的基准数据集,分析分类准确性和推理深度。 Result: VLM能生成合理解释并检测表面异常,但依赖度不足,易受风格元素误导;其优势在于可解释性和上下文分析。 Conclusion: VLM目前无法独立用于深度伪造检测,但在混合或人机协作框架中具有潜力。 Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.[42] Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation
Juan C. Martinez-Sevilla,Joan Cerveto-Serrano,Noelia Luna,Greg Chapman,Craig Sapp,David Rizo,Jorge Calvo-Zaragoza
Main category: cs.CV
TL;DR: 介绍了Sheet Music Benchmark (SMB)数据集和OMR-NED评估指标,用于改进光学音乐识别(OMR)研究。
Details
Motivation: 解决OMR研究中缺乏标准化数据集和评估指标的问题。 Method: 创建SMB数据集和OMR-NED指标,并进行基线实验。 Result: SMB和OMR-NED为OMR研究提供了标准化工具和详细错误分析。 Conclusion: 填补了OMR评估的空白,支持更优方法的比较和选择。 Abstract: In this work, we introduce the Sheet Music Benchmark (SMB), a dataset of six hundred and eighty-five pages specifically designed to benchmark Optical Music Recognition (OMR) research. SMB encompasses a diverse array of musical textures, including monophony, pianoform, quartet, and others, all encoded in Common Western Modern Notation using the Humdrum **kern format. Alongside SMB, we introduce the OMR Normalized Edit Distance (OMR-NED), a new metric tailored explicitly for evaluating OMR performance. OMR-NED builds upon the widely-used Symbol Error Rate (SER), offering a fine-grained and detailed error analysis that covers individual musical elements such as note heads, beams, pitches, accidentals, and other critical notation features. The resulting numeric score provided by OMR-NED facilitates clear comparisons, enabling researchers and end-users alike to identify optimal OMR approaches. Our work thus addresses a long-standing gap in OMR evaluation, and we support our contributions with baseline experiments using standardized SMB dataset splits for training and assessing state-of-the-art methods.[43] Class-Incremental Learning for Honey Botanical Origin Classification with Hyperspectral Images: A Study with Continual Backpropagation
Guyang Zhang,Waleed Abdulla
Main category: cs.CV
TL;DR: 论文研究了蜂蜜的植物来源区分技术,提出了一种结合持续反向传播(CB)算法的类增量学习(CIL)方法,以提高性能。
Details
Motivation: 蜂蜜的植物来源影响其风味和健康价值,但难以一次性收集所有品种训练模型,因此需要类增量学习技术。 Method: 研究比较了多种CIL算法,并提出结合CB算法的方法,通过重新初始化较少使用的隐藏神经元来提升性能。 Result: 实验表明,CB方法将大多数CIL算法的性能提高了1-7%。 Conclusion: 提出的CB方法有效解决了类增量学习中的可塑性损失问题,提升了蜂蜜植物来源区分的准确性。 Abstract: Honey is an important commodity in the global market. Honey types of different botanical origins provide diversified flavors and health benefits, thus having different market values. Developing accurate and effective botanical origin-distinguishing techniques is crucial to protect consumers' interests. However, it is impractical to collect all the varieties of honey products at once to train a model for botanical origin differentiation. Therefore, researchers developed class-incremental learning (CIL) techniques to address this challenge. This study examined and compared multiple CIL algorithms on a real-world honey hyperspectral imaging dataset. A novel technique is also proposed to improve the performance of class-incremental learning algorithms by combining with a continual backpropagation (CB) algorithm. The CB method addresses the issue of loss-of-plasticity by reinitializing a proportion of less-used hidden neurons to inject variability into neural networks. Experiments showed that CB improved the performance of most CIL methods by 1-7\%.[44] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao
Main category: cs.CV
TL;DR: PSLG-SAM框架通过两阶段方法(粗定位和精细分割)解决RRSIS任务中的密集标注和复杂场景问题,显著减少标注负担并提升性能。
Details
Motivation: 当前RRSIS方法依赖多模态融合和语义分割头,面临密集标注和复杂场景解释的挑战。 Method: 提出PSLG-SAM框架,分为粗定位(视觉定位网络)和精细分割(SAM增强)两阶段,后者可无训练实现。 Result: 在RRSIS-D和RRSIS-M数据集上表现优异,超越现有方法。 Conclusion: PSLG-SAM有效解决RRSIS任务问题,减少标注需求并提升分割精度。 Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.[45] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft
Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai
Main category: cs.CV
TL;DR: J-DDL系统通过结合2D图像和3D点云技术,优化YOLO架构,实现战斗机表面损伤的高效检测与定位。
Details
Motivation: 传统人工检测方法在战斗机表面缺陷检测中存在效率、一致性和扩展性不足的问题。 Method: 集成激光扫描仪和相机捕获的2D图像与3D点云数据,采用优化的YOLO架构(含Fasternet块、EMA模块和Inner-CIOU损失函数)进行损伤检测与3D定位。 Result: 实验验证了J-DDL的高效性,并发布了首个公开的飞机损伤数据集。 Conclusion: J-DDL显著提升了自动化飞机检测技术的水平。 Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.[46] CogStream: Context-guided Streaming Video Question Answering
Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu
Main category: cs.CV
TL;DR: 论文提出了一种名为CogStream的新任务,专注于流媒体视频推理中的上下文信息选择问题,并提出了一个基线模型CogReasoner,通过视觉流压缩和历史对话检索高效完成任务。
Details
Motivation: 现有Vid-LLMs在处理流媒体视频时依赖全部历史上下文信息,导致计算负担重且可能引入无关信息,影响推理效果。 Method: 提出CogStream任务,并开发半自动标注的数据集;基线模型CogReasoner结合视觉流压缩和历史对话检索技术。 Result: 实验证明该方法有效解决了流媒体视频推理中的上下文选择问题。 Conclusion: CogStream任务和CogReasoner模型为流媒体视频推理提供了新的解决方案,未来代码将开源。 Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.[47] ALBERT: Advanced Localization and Bidirectional Encoder Representations from Transformers for Automotive Damage Evaluation
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: ALBERT是一个专为汽车损伤和部件分割设计的实例分割模型,利用双向编码器表示和高级定位机制,能准确区分真实与虚假损伤,并分割汽车部件。
Details
Motivation: 开发一个智能汽车检测和评估系统,需要准确识别和分类汽车损伤及部件。 Method: ALBERT基于双向编码器表示,结合高级定位机制,训练于大规模标注的汽车数据集,涵盖26种损伤类型、7种虚假损伤变体和61种汽车部件。 Result: 模型在分割精度和损伤分类方面表现优异。 Conclusion: ALBERT为智能汽车检测和评估应用提供了有效解决方案。 Abstract: This paper introduces ALBERT, an instance segmentation model specifically designed for comprehensive car damage and part segmentation. Leveraging the power of Bidirectional Encoder Representations, ALBERT incorporates advanced localization mechanisms to accurately identify and differentiate between real and fake damages, as well as segment individual car parts. The model is trained on a large-scale, richly annotated automotive dataset that categorizes damage into 26 types, identifies 7 fake damage variants, and segments 61 distinct car parts. Our approach demonstrates strong performance in both segmentation accuracy and damage classification, paving the way for intelligent automotive inspection and assessment applications.[48] SLICK: Selective Localization and Instance Calibration for Knowledge-Enhanced Car Damage Segmentation in Automotive Insurance
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: SLICK是一个用于精确和鲁棒的汽车损伤分割的新框架,通过结构先验和领域知识解决实际检测挑战。
Details
Motivation: 解决汽车损伤分割在遮挡、变形或复杂场景下的精确性和鲁棒性问题。 Method: 包括选择性部件分割、定位感知注意力块、实例敏感细化头、跨通道校准和知识融合模块。 Result: 在大规模数据集上表现出优越的分割性能和实用性。 Conclusion: SLICK在保险和汽车检测工作流程中具有高效性和鲁棒性。 Abstract: We present SLICK, a novel framework for precise and robust car damage segmentation that leverages structural priors and domain knowledge to tackle real-world automotive inspection challenges. SLICK introduces five key components: (1) Selective Part Segmentation using a high-resolution semantic backbone guided by structural priors to achieve surgical accuracy in segmenting vehicle parts even under occlusion, deformation, or paint loss; (2) Localization-Aware Attention blocks that dynamically focus on damaged regions, enhancing fine-grained damage detection in cluttered and complex street scenes; (3) an Instance-Sensitive Refinement head that leverages panoptic cues and shape priors to disentangle overlapping or adjacent parts, enabling precise boundary alignment; (4) Cross-Channel Calibration through multi-scale channel attention that amplifies subtle damage signals such as scratches and dents while suppressing noise like reflections and decals; and (5) a Knowledge Fusion Module that integrates synthetic crash data, part geometry, and real-world insurance datasets to improve generalization and handle rare cases effectively. Experiments on large-scale automotive datasets demonstrate SLICK's superior segmentation performance, robustness, and practical applicability for insurance and automotive inspection workflows.[49] ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025
Jing He,Yiqing Wang,Lingling Li,Kexin Zhang,Puhua Chen
Main category: cs.CV
TL;DR: CR-CLIP是一种高效的视觉-文本多实例检索模型,通过跨模态注意力流模块实现双向动态交互和特征细化,显著提升了检索性能。
Details
Motivation: 解决视觉-文本多实例检索任务中特征交互不足的问题,提升语义对齐和检索准确性。 Method: 基于AVION双编码器,引入跨模态注意力流模块进行特征动态交互和细化,结合对称多相似性损失优化语义对齐。 Result: 在EPIC-KITCHENS-100数据集上达到66.78mAP和82.08nDCG,显著优于基线模型。 Conclusion: CR-CLIP通过特征细化显著提升了跨模态检索性能,代码将开源。 Abstract: This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on https://github.com/delCayr/ContextRefine-Clip[50] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
Yutong Zhou,Masahiro Ryo
Main category: cs.CV
TL;DR: 提出了一种端到端的视觉到因果框架,将物种图像转化为可解释的栖息地偏好因果洞察。
Details
Motivation: 理解物种栖息地偏好对生态研究和生物多样性保护至关重要,但现有方法分散且不易为非专家所用。 Method: 整合物种识别、全球分布检索、伪缺失采样和气候数据提取,利用因果推断方法分析环境特征对物种分布的影响。 Result: 通过蜜蜂和花卉物种的案例展示了框架潜力,生成了统计支持的人类可读因果解释。 Conclusion: 该框架结合多模态AI和生态建模实践,为物种栖息地描述提供了新方法。 Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.[51] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics
Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
Main category: cs.CV
TL;DR: 论文提出了一种新指标CEI,用于检测人脸识别系统中的细微人口统计偏差,特别是分数分布尾部的差异。
Details
Motivation: 现有指标难以检测高性能人脸识别系统中的细微人口统计偏差,尤其是在分数分布尾部。 Method: 引入CEI指标,分别分析真实和冒名顶替分数分布,可配置关注尾部概率,同时考虑整体分布形状。还提出自动化版本CEI^A。 Result: 实验证明CEI在检测细微偏差方面优于现有方法,适用于多种数据集和模型。 Conclusion: CEI为评估人脸识别公平性提供了敏感且鲁棒的工具,方法可推广至其他统计分布尾部分析问题。 Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI's superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.[52] LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
Hongbeen Park,Minjeong Park,Giljoo Nam,Jinkyu Kim
Main category: cs.CV
TL;DR: LRSLAM提出了一种基于低秩张量分解的高效视觉SLAM模型,解决了现有方法在实时性、内存占用和扩展性上的问题,并在性能上优于现有技术。
Details
Motivation: 密集视觉SLAM在实时性、鲁棒性和大规模场景扩展性方面存在挑战,现有神经隐式表示方法计算和内存成本高,ESLAM的平面张量分解仍面临内存增长问题。 Method: LRSLAM采用低秩张量分解方法(Six-axis和CP分解),优化了收敛速度、内存效率和重建/定位质量。 Result: 在多种室内RGB-D数据集上评估,LRSLAM在参数效率、处理时间和准确性上表现优异,同时保持了重建和定位质量。 Conclusion: LRSLAM通过低秩张量分解实现了高效、高性能的视觉SLAM,为相关领域提供了实用解决方案。 Abstract: Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM's superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication.[53] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou
Main category: cs.CV
TL;DR: 提出了一种基于扩散变换器(DiT)的框架,用于生成高保真的人-产品演示视频,解决了现有方法在身份保持和空间关系理解上的不足。
Details
Motivation: 在电子商务和数字营销中,生成高质量的人-产品演示视频对产品展示至关重要,但现有方法难以同时保持人和产品的身份及空间关系。 Method: 采用DiT框架,注入配对的人-产品参考信息,使用掩码交叉注意力机制,结合3D身体网格模板和产品边界框提供运动指导,并利用结构化文本编码增强3D一致性。 Result: 在混合数据集上训练,通过数据增强策略,方法在保持身份完整性和生成真实演示动作方面优于现有技术。 Conclusion: 提出的DiT框架有效解决了人-产品演示视频生成中的身份保持和空间关系问题,生成效果优于现有方法。 Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.[54] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He
Main category: cs.CV
TL;DR: PLACE框架通过病理级对齐和相关性探索提升医学图像与报告的多模态学习效果,无需额外标注。
Details
Motivation: 解决医学领域数据稀缺问题,同时应对报告冗长、语义复杂及病理一致性不足的挑战。 Method: 提出病理级跨模态对齐(PCMA)方法,结合视觉病理观察提取器和相关性代理任务。 Result: 在分类、检索、分割、检测和报告生成等任务中达到最优性能。 Conclusion: PLACE框架有效提升了医学视觉表征的学习效果,具有广泛适用性和鲁棒性。 Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.[55] DanceChat: Large Language Model-Guided Music-to-Dance Generation
Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan
Main category: cs.CV
TL;DR: DanceChat利用大型语言模型(LLM)生成音乐到舞蹈的转换,通过文本指令增强多样性和音乐风格对齐。
Details
Motivation: 音乐与舞蹈之间存在语义鸿沟,音乐仅提供抽象线索,且一对多映射导致舞蹈生成受限。数据稀缺进一步加剧了挑战。 Method: 1. LLM生成伪指令;2. 多模态特征提取与融合;3. 基于扩散的运动合成与多模态对齐损失。 Result: 在AIST++数据集和人工评估中,DanceChat在质量和数量上均优于现有方法。 Conclusion: DanceChat通过LLM的显式指导,显著提升了舞蹈生成的多样性和音乐对齐效果。 Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model\^a\u{A}\'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.[56] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning
Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu
Main category: cs.CV
TL;DR: T2I-PAL利用文本生成图像减少模态差异,结合热图和原型增强多标签识别,性能提升3.47%。
Details
Motivation: 解决CLIP模型中文本与图像模态差异问题,提升仅用文本进行参数高效微调时的图像识别性能。 Method: 利用文本生成图像模型生成多样图像,结合类热图和可学习原型增强局部特征表示,并融合提示调优和适配器学习。 Result: 在多个基准测试中平均性能提升3.47%,优于现有方法。 Conclusion: T2I-PAL无需全语义标注,减少人工工作量,且兼容现有CLIP框架。 Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.[57] Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres
Muskan Dosi,Chiranjeev Chiranjeev,Kartik Thakral,Mayank Vatsa,Richa Singh
Main category: cs.CV
TL;DR: 论文提出HyperSphereDiff方法,通过方向性噪声对齐超球面数据结构,解决了传统扩散模型在非欧几里得数据(如超球面)上的性能不足问题。
Details
Motivation: 传统扩散模型依赖各向同性高斯噪声,适用于欧几里得空间,但无法有效处理超球面等非欧几里得数据中的角度几何结构,导致生成性能下降。 Method: 提出HyperSphereDiff,利用方向性噪声对齐超球面结构,保留类别几何特征并捕捉角度不确定性。 Result: 理论和实验证明,该方法能更好地保留超球面数据的固有几何结构,在四个物体数据集和两个人脸数据集上表现优异。 Conclusion: HyperSphereDiff通过几何对齐和不确定性建模,显著提升了超球面数据的生成性能。 Abstract: Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {https://github.com/IAB-IITJ/Harmonizing-Geometry-and-Uncertainty-Diffusion-with-Hyperspheres/}[58] Rethinking Random Masking in Self Distillation on ViT
Jihyeon Seong,Hyunkyung Han
Main category: cs.CV
TL;DR: 研究探讨了在DINO自蒸馏框架中随机掩码的作用,提出了一种非对称掩码策略,仅在学生的全局视图中应用随机掩码,从而提升模型鲁棒性和性能。
Details
Motivation: 随机掩码可能无意中消除关键语义信息,因此需要更智能的掩码策略。 Method: 在DINO框架中,仅对学生的全局视图应用随机掩码,保留学生局部视图和教师全局视图的原始形式。 Result: 实验表明,这种非对称掩码策略能生成更鲁棒和细粒度的注意力图,提升下游任务性能。 Conclusion: 非对称随机掩码策略在自蒸馏框架中有效,能平衡训练效率和语义信息保留。 Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student's global view, while preserving the student's local views and the teacher's global view in their original, unmasked forms. This design leverages DINO's multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.[59] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement
Jin Huang,Honghua Chen,Mingqiang Wei
Main category: cs.CV
TL;DR: 提出了一种名为HEA-MM的分层误差评估框架,用于航空器CAD模型的质量评估,通过全局、部件和特征三个层次进行误差分析。
Details
Motivation: 航空设备的高质量要求(高性能、高稳定性和高可靠性)需要精确的误差评估方法。 Method: 使用结构光扫描仪获取工件3D数据,通过全局、部件和特征三个层次进行误差分析,并提出了优化基元细化方法和两阶段圆形孔检测算法。 Result: 实验证明HEA-MM框架在多种航空器CAD模型上有效。 Conclusion: HEA-MM框架为航空器CAD模型提供了全面的误差评估方法,满足了高质量制造的需求。 Abstract: The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.[60] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection
Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai
Main category: cs.CV
TL;DR: SSP框架通过语义解耦空间分区,解决了点监督下密集场景中目标检测的样本分配和实例混淆问题,显著提升了性能。
Details
Motivation: 高密度场景中目标检测的标注成本高,现有方法在点监督下存在样本分配不足和实例混淆问题。 Method: SSP结合规则驱动的先验注入和数据驱动的标签净化,通过像素级和语义级空间分区优化样本分配和边界框提取。 Result: 在DOTA-v1.0等数据集上,SSP在点监督下达到45.78% mAP,优于SOTA方法4.10%,与ORCNN和ReDet结合时分别达到47.86%和48.50% mAP。 Conclusion: SSP为密集场景中的目标检测提供了一种高效且性能优越的解决方案。 Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.[61] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model
Eshan Ramesh,Nishio Takayuki
Main category: cs.CV
TL;DR: LatentCSI是一种利用预训练潜在扩散模型(LDM)从WiFi CSI测量生成物理环境图像的新方法,通过轻量级神经网络直接映射CSI振幅到LDM的潜在空间,实现高效高质量的图像合成。
Details
Motivation: 传统方法依赖复杂且计算密集的技术(如GANs),而LatentCSI旨在绕过像素空间图像生成的挑战,避免传统图像编码阶段,提升效率和图像质量。 Method: 使用轻量级神经网络将CSI振幅映射到LDM的潜在空间,结合文本引导的扩散模型去噪,最后通过LDM的解码器生成高分辨率图像。 Result: 在两个数据集上验证,LatentCSI在计算效率和感知质量上优于基线方法,并具备文本引导的可控性。 Conclusion: LatentCSI提供了一种高效、高质量的图像合成方法,同时具备文本引导的独特优势。 Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.[62] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu
Main category: cs.CV
TL;DR: MSTAR是一种无需边界框标注的场景文本检索方法,通过动态多粒度文本表示和风格感知指令统一多种查询类型,显著提升性能。
Details
Motivation: 现有方法依赖昂贵的边界框标注且难以统一多种查询类型,限制了场景文本检索的灵活性和效率。 Method: 提出MSTAR方法,结合渐进视觉嵌入动态捕捉文本多粒度表示,整合风格感知指令,并引入多实例匹配模块增强视觉-语言对齐。 Result: 在七个公共数据集和MQTR数据集上表现优越,MAP提升6.4%,在MQTR上平均提升8.5%。 Conclusion: MSTAR通过消除标注成本并支持多查询类型,显著推进了场景文本检索的实用性和性能。 Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.[63] VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
Main category: cs.CV
TL;DR: VideoDeepResearch挑战了长视频理解(LVU)需要多模态大语言模型(MLLMs)的传统观点,仅通过文本推理模型(LRM)和模块化多模态工具包实现显著性能提升。
Details
Motivation: 长视频理解任务复杂且受限于上下文窗口,传统方法依赖多模态大语言模型。本文挑战这一假设,提出更高效的解决方案。 Method: 提出VideoDeepResearch框架,结合文本推理模型(LRM)和多模态工具包(如检索器和视觉感知器),通过选择性访问视频内容完成任务。 Result: 在MLVU、LVBench和LongVideoBench基准测试中,性能分别提升9.6%、6.6%和3.9%,超越现有方法。 Conclusion: 代理系统在解决长视频理解任务中具有潜力,无需依赖复杂的多模态大语言模型。 Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.[64] TexTailor: Customized Text-aligned Texturing via Effective Resampling
Suin Lee,Dae-Shik Kim
Main category: cs.CV
TL;DR: TexTailor提出了一种新方法,通过改进扩散模型和自适应调整相机位置,生成与文本描述一致的对象纹理。
Details
Motivation: 现有方法在纹理合成过程中存在视角间纹理属性逐渐偏移的问题,且固定相机位置限制了纹理信息的有效利用。 Method: 采用重采样方案整合已合成纹理信息,并微调深度感知扩散模型;引入性能保持损失和自适应相机位置调整。 Result: 在Objaverse和ShapeNet数据集上,TexTailor在生成视角一致纹理方面优于现有方法。 Conclusion: TexTailor通过改进纹理合成过程,显著提升了纹理一致性和生成质量。 Abstract: We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object's geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model's original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object's geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at https://github.com/Adios42/Textailor[65] VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
Main category: cs.CV
TL;DR: 本文提出了一种直接从视频中学习上下文图像编辑模型的方法,通过多模态序列标注和块因果扩散变换器,实现了高效的图像编辑能力。
Details
Motivation: 现有方法依赖任务特定流程和专家模型,限制了灵活性和扩展性。本文探索直接从视频中学习上下文图像编辑的可能性。 Method: 提出了一种可扩展的视频标注方法,设计块因果扩散变换器,通过三个代理任务(下一图像预测、当前分割预测、下一分割预测)进行训练。 Result: 模型在上下文图像编辑任务中表现优异,在多轮图像编辑基准测试中达到最优水平,并展示了多概念组合、故事生成等潜力。 Conclusion: 直接从视频中学习上下文图像编辑是可行的,模型在多个任务中表现出色,为未来研究提供了新方向。 Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.[66] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models
Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O'Neil,Sotirios A. Tsaftaris
Main category: cs.CV
TL;DR: 本文提出了一种针对医学影像的文本到图像潜在扩散模型微调框架,解决了临床信息与影像对齐的问题,并在标准数据集上取得了最佳性能。
Details
Motivation: 医学影像领域的文本到图像潜在扩散模型因数据稀缺而研究不足,且现有模型未能有效对齐临床文本与影像信息。 Method: 通过微调预训练模型,改进多模态对齐,使其适用于下游任务(如短语定位)。 Result: 在MS-CXR数据集上达到新最优性能,并在VinDr-CXR数据上表现稳健。 Conclusion: 提出的方法有效提升了医学影像与文本的对齐能力,具有实际应用潜力。 Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.[67] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
Main category: cs.CV
TL;DR: 论文提出知识图像生成任务及MMMG基准,评估图像生成模型的推理能力,发现现有模型表现不佳,并发布FLUX-Reason作为开放基线。
Details
Motivation: 知识图像对人类文明和学习至关重要,但生成此类图像需多模态推理,现有模型能力不足。 Method: 提出MMMG基准,包含4,456个专家验证的图像-提示对,采用统一知识图谱表示,并设计MMMG-Score评估指标。 Result: 评估16种先进文本到图像生成模型,发现推理能力严重不足,GPT-4o得分仅50.20。 Conclusion: 发布FLUX-Reason作为开放基线,推动知识图像生成领域的进步。 Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.[68] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models
Francisco Caetano,Christiaan Viviers,Peter H. N. De With,Fons van der Sommen
Main category: cs.CV
TL;DR: SymmFlow是一种对称流匹配框架,统一了语义分割、分类和图像生成任务,通过双向一致性学习目标实现高效采样和语义保留。
Details
Motivation: 现有方法在分布变换中缺乏双向一致性和语义保留能力,SymmFlow旨在解决这些问题,支持灵活的像素级和图像级条件生成。 Method: 采用对称学习目标联合建模正向和反向变换,引入新训练目标显式保留语义信息,支持一步分割和分类。 Result: 在CelebAMask-HQ和COCO-Stuff上分别达到FID 11.9和7.0,同时展示出在分割和分类任务中的竞争力。 Conclusion: SymmFlow为多任务生成模型提供了高效且语义一致的解决方案,代码将开源。 Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.[69] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang
Main category: cs.CV
TL;DR: GigaVideo-1是一种高效的视频生成微调框架,无需人工标注,通过自动反馈提升预训练模型的性能。
Details
Motivation: 现有视频生成模型在实例保留、运动合理性等方面仍需微调,但传统方法依赖人工标注和高计算资源。 Method: 提出基于提示的数据引擎和奖励引导的训练策略,利用预训练视觉语言模型自动优化。 Result: 在VBench-2.0基准测试中,GigaVideo-1平均提升4%,仅需4 GPU小时。 Conclusion: GigaVideo-1高效且有效,无需人工标注,适用于实际应用。 Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.[70] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović
Main category: cs.CV
TL;DR: PiPViT是一种基于视觉变换器的原型模型,通过学习可解释的原型来提升医学图像分类的透明度和临床相关性。
Details
Motivation: 现有原型方法在医学图像中学习的原型过于细粒度,且可视化结果与人类可理解的生物标志物不一致。PiPViT旨在解决这些问题。 Method: PiPViT利用视觉变换器(ViT)捕获长距离依赖关系,结合对比学习和多分辨率输入处理,学习具有临床意义的原型。 Result: 在视网膜OCT图像分类任务中,PiPViT表现优异,且原型具有临床相关性,能够提供更有意义的解释。 Conclusion: PiPViT能够透明地解释其决策,帮助临床医生理解诊断结果。 Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT[71] Enhancing Deepfake Detection using SE Block Attention with CNN
Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy
Main category: cs.CV
TL;DR: 提出了一种轻量级CNN结合SE模块的深度伪造检测模型,实现了高准确率且节省计算资源。
Details
Motivation: 深度伪造技术威胁信息真实性,现有检测模型体积大、资源消耗高。 Method: 采用轻量级CNN结合SE模块,动态调整特征通道权重。 Result: 在Style GAN数据集上分类准确率94.14%,AUC-ROC得分0.985。 Conclusion: 该方法为高效、可扩展的深度伪造检测提供了新思路。 Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.[72] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework
Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 论文提出了一种名为UAC的新框架,通过文本提示生成高保真对抗样本,提升CAPTCHA的多样性,并支持定向和非定向攻击。
Details
Motivation: 传统CAPTCHA方案因深度学习的发展易受自动化攻击,现有对抗攻击方法依赖原始图像特征,导致失真且适用性受限。 Method: UAC利用大型语言模型生成对抗样本,针对定向攻击采用EDICT方法优化扩散模型,非定向攻击则引入BP-UAC策略,结合多模态梯度和双路径优化。 Result: 实验表明BP-UAC在多种系统中攻击成功率高,生成的CAPTCHA对人类和DNN均难以区分。 Conclusion: UAC框架有效解决了现有对抗攻击方法的局限性,提升了CAPTCHA的安全性和多样性。 Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.[73] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral
Main category: cs.CV
TL;DR: 提出一种多任务架构,结合冻结的FaRL视觉语言主干和紧凑的MLP,用于未成年自动筛查,通过改进的损失函数和采样策略提升性能。
Details
Motivation: 解决未成年人在公开数据中代表性不足及分布偏移问题,提升自动筛查的准确性和鲁棒性。 Method: 采用多任务架构,结合年龄回归和多个二元分类任务,引入α加权焦点损失和年龄平衡采样。 Result: 在ASORES-39k测试集上,RMSE从5.733降至5.656,F2分数从0.801提升至0.857;在ASWIFT-20k测试集上,召回率接近0.99,F2分数从0.742提升至0.833。 Conclusion: 提出的方法显著提升了未成年筛查的准确性和鲁棒性,尤其在分布偏移和极端条件下表现优异。 Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test "ASWIFT-20k" of 20k-images, stressing extreme pose ($>$45{\deg}), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.[74] Continual Hyperbolic Learning of Instances and Classes
Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth
Main category: cs.CV
TL;DR: 论文提出了一种名为HyperCLIC的持续学习算法,用于同时处理实例和类别的持续学习任务,利用双曲空间建模层次关系,并在动态真实世界环境中验证了其有效性。
Details
Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,但目前方法仅关注单一层面。论文旨在填补这一空白。 Method: 提出HyperCLIC算法,利用双曲空间建模层次关系,结合双曲分类和蒸馏目标。 Result: 在EgoObjects数据集上验证,HyperCLIC在多粒度任务中表现优异,提升了层次泛化能力。 Conclusion: HyperCLIC为同时学习实例和类别提供了一种有效方法,适用于动态真实世界场景。 Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.[75] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement
Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He
Main category: cs.CV
TL;DR: 提出了一种名为UMBD的生成式细化框架,用于伪装目标检测(COD),通过不确定性引导的掩蔽机制和伯努利扩散,选择性优化分割质量较差的区域。
Details
Motivation: 现有COD方法在细化后处理方面仍有潜力未充分挖掘,特别是针对分割质量较差的区域。 Method: 提出UMBD模型,结合不确定性引导的掩蔽机制和伯努利扩散,并设计HUQNet网络进行多源不确定性量化。 Result: 在多个COD基准测试中表现优异,平均MAE提升5.5%,加权F-measure提升3.2%。 Conclusion: UMBD框架能无缝集成现有COD模型,显著提升性能且计算开销适中。 Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.[76] Deep Learning-based Multi Project InP Wafer Simulation for Unsupervised Surface Defect Detection
Emílio Dolgener Cantú,Rolf Klemens Wittmann,Oliver Abdeen,Patrick Wagner,Wojciech Samek,Moritz Baier,Sebastian Lapuschkin
Main category: cs.CV
TL;DR: 论文提出了一种基于深度神经网络的合成黄金标准生成方法,用于解决InP晶圆制造中因缺乏黄金标准而依赖人工检测的问题。该方法通过模拟真实晶圆图像,显著提升了缺陷检测效率。
Details
Motivation: InP晶圆制造中,由于生产规模小和设计变异性高,缺乏黄金标准模板,导致缺陷检测依赖人工且效率低下。 Method: 提出了一种基于深度神经网络的方法,通过训练模型从CAD数据生成逼真的InP晶圆图像,以模拟黄金标准。 Result: 深度学习方法在合成数据和真实晶圆照片上均优于基于决策树的基线方法,能够高效生成模拟黄金标准用于缺陷检测。 Conclusion: 该方法通过生成模拟黄金标准,显著提升了InP晶圆缺陷检测的效率和实用性。 Abstract: Quality management in semiconductor manufacturing often relies on template matching with known golden standards. For Indium-Phosphide (InP) multi-project wafer manufacturing, low production scale and high design variability lead to such golden standards being typically unavailable. Defect detection, in turn, is manual and labor-intensive. This work addresses this challenge by proposing a methodology to generate a synthetic golden standard using Deep Neural Networks, trained to simulate photo-realistic InP wafer images from CAD data. We evaluate various training objectives and assess the quality of the simulated images on both synthetic data and InP wafer photographs. Our deep-learning-based method outperforms a baseline decision-tree-based approach, enabling the use of a 'simulated golden die' from CAD plans in any user-defined region of a wafer for more efficient defect detection. We apply our method to a template matching procedure, to demonstrate its practical utility in surface defect detection.[77] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng
Main category: cs.CV
TL;DR: IQE-CLIP是一种基于CLIP的零样本和少样本异常检测框架,针对医学领域设计,通过结合文本和视觉信息生成异常敏感的嵌入,显著提升了性能。
Details
Motivation: 现有CLIP方法依赖特定场景的提示且难以区分正常与异常实例,同时在医学领域探索有限。 Method: 提出IQE-CLIP框架,引入基于类和可学习的提示令牌,设计实例感知查询模块提取多模态区域级信息。 Result: 在六个医学数据集上,IQE-CLIP在零样本和少样本设置中均达到最先进性能。 Conclusion: IQE-CLIP通过多模态信息融合有效解决了医学领域的异常检测问题。 Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.[78] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu
Main category: cs.CV
TL;DR: PosterCraft是一个统一框架,用于生成高质量海报,通过多阶段优化流程提升文本渲染、布局和美学效果。
Details
Motivation: 生成具有艺术性和视觉吸引力的海报比简单设计更具挑战性,需要解决文本渲染、布局和风格融合的问题。 Method: PosterCraft采用级联工作流程,包括文本渲染优化、区域感知微调、美学文本强化学习和视觉语言反馈细化。 Result: PosterCraft在渲染准确性、布局一致性和视觉吸引力方面显著优于开源基线,接近商业系统水平。 Conclusion: PosterCraft通过自动化数据构建和多阶段优化,实现了高质量海报生成,为相关领域提供了新方法。 Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft[79] Stroke-based Cyclic Amplifier: Image Super-Resolution at Arbitrary Ultra-Large Scales
Wenhao Guo,Peng Lu,Xujun Peng,Zhaoran Zhao,Sheng Li
Main category: cs.CV
TL;DR: 提出了一种基于笔画向量放大的统一模型(SbCA),用于超大规模图像超分辨率任务,解决了传统方法在超出训练范围时性能下降的问题。
Details
Motivation: 传统任意尺度图像超分辨率方法在超出训练范围时性能显著下降,导致图像模糊。 Method: 通过笔画向量放大器将图像分解为矢量图形进行放大,并结合细节补全模块恢复缺失细节,采用循环策略迭代优化。 Result: 在超大规模上采样任务(如×100)中表现优异,显著优于现有方法,生成高质量图像。 Conclusion: SbCA模型有效解决了分布漂移问题,消除了伪影和模糊,适用于超大规模图像超分辨率任务。 Abstract: Prior Arbitrary-Scale Image Super-Resolution (ASISR) methods often experience a significant performance decline when the upsampling factor exceeds the range covered by the training data, introducing substantial blurring. To address this issue, we propose a unified model, Stroke-based Cyclic Amplifier (SbCA), for ultra-large upsampling tasks. The key of SbCA is the stroke vector amplifier, which decomposes the image into a series of strokes represented as vector graphics for magnification. Then, the detail completion module also restores missing details, ensuring high-fidelity image reconstruction. Our cyclic strategy achieves ultra-large upsampling by iteratively refining details with this unified SbCA model, trained only once for all, while keeping sub-scales within the training range. Our approach effectively addresses the distribution drift issue and eliminates artifacts, noise and blurring, producing high-quality, high-resolution super-resolved images. Experimental validations on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing methods in ultra-large upsampling tasks (e.g. $\times100$), delivering visual quality far superior to state-of-the-art techniques.[80] SlotPi: Physics-informed Object-centric Reasoning Models
Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun
Main category: cs.CV
TL;DR: SlotPi是一个基于物理知识的对象中心动态推理模型,解决了现有方法忽略物理知识整合和模型适应性验证的问题。
Details
Motivation: 人类通过观察世界获取物理知识并应用于动态场景推理,而现有方法缺乏物理知识整合和多样化场景验证。 Method: SlotPi结合了基于哈密顿原理的物理模块和时空预测模块,用于动态预测。 Result: 模型在基准和流体数据集上的预测和视觉问答任务中表现优异,并在真实世界数据集中验证了其适应性。 Conclusion: SlotPi的强适应性为开发更先进的世界模型奠定了基础。 Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model's strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model's capabilities. The model's robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.[81] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning
Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae
Main category: cs.CV
TL;DR: 论文提出了一种结合事件相机与强化学习的机器人导航控制器,实现实时人中心导航与避障。
Details
Motivation: 传统基于图像的控制器存在固定速率、运动模糊和延迟问题,事件相机的异步特性可解决这些问题。 Method: 整合事件相机感知、额外距离传感和深度确定性策略梯度优化,结合模仿学习提高样本效率。 Result: 在模拟环境中展示了鲁棒的导航、行人跟随和避障能力。 Conclusion: 该方法通过事件相机和强化学习的结合,为实时导航提供了有效解决方案。 Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.[82] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization
Mario Barbara,Alaa Maalouf
Main category: cs.CV
TL;DR: 提出了一种名为Prompts-to-Summaries的零样本视频摘要方法,通过结合视频语言模型和大语言模型,无需训练数据即可生成用户可控的视频摘要,性能优于无监督方法并媲美监督方法。
Details
Motivation: 视频数据的爆炸性增长需要灵活的用户可控摘要工具,现有方法依赖数据集或无法结合用户自然语言意图。 Method: 方法包括视频分段、场景描述生成、大语言模型评分及分数传播,使用一致性和独特性指标细化帧重要性。 Result: 在SumMe和TVSum上超越无监督方法,在QFVS基准上表现竞争力,并发布新数据集VidSum-Reason。 Conclusion: 预训练多模态模型通过合理提示和分数传播,为通用文本查询视频摘要提供了强大基础。 Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.[83] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing
Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan
Main category: cs.CV
TL;DR: SmoothProper是一种即插即用的神经模块,通过结合对偶优化层和定制交互项,解决了稀疏特征和大位移挑战,显著降低了配准误差。
Details
Motivation: 传统无监督DIR方法在稀疏特征和大位移场景下表现不佳,SmoothProper旨在通过神经网络前向传播中强制平滑性和消息传递来解决这些问题。 Method: SmoothProper集成对偶优化层和定制交互项,实现流信号的空间传播、平滑性强制和结构一致性保持,无需超参数调优。 Result: 在视网膜血管数据集上,SmoothProper将配准误差降至1.88像素,首次有效解决了稀疏特征和大位移挑战。 Conclusion: SmoothPrope是一种模型无关、高效的无监督DIR方法,显著提升了配准精度。 Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.[84] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders
Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian
Main category: cs.CV
TL;DR: 论文提出了一种基于掩码自编码器的遮挡感知手-物体姿态估计方法(HOMAE),通过目标聚焦掩码策略和多尺度特征融合,显著提升了遮挡情况下的估计性能。
Details
Motivation: 现有方法在处理手-物体交互中的遮挡问题时,未能充分探索全局结构感知和推理,限制了其有效性。 Method: 提出目标聚焦掩码策略,结合多尺度特征和符号距离场(SDF)与点云的融合,增强遮挡区域的鲁棒性。 Result: 在DexYCB和HO3Dv2基准测试中,HOMAE实现了最先进的性能。 Conclusion: HOMAE通过结合全局上下文和局部几何信息,有效解决了遮挡问题,为手-物体姿态估计提供了新思路。 Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.[85] Post-Training Quantization for Video Matting
Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang
Main category: cs.CV
TL;DR: 本文提出了一种针对视频抠图模型的PTQ框架,通过两阶段量化策略、全局仿射校准和光流辅助组件,显著提升了量化后的精度和时序一致性,实现了接近全精度的性能。
Details
Motivation: 视频抠图在资源受限设备上的部署面临挑战,现有PTQ方法在精度和时序一致性上存在不足。 Method: 提出两阶段PTQ策略(块重建优化和全局校准)、全局仿射校准(GAC)方法和光流辅助(OFA)组件。 Result: PTQ4VM在不同比特宽度下均达到最优精度,4位量化性能接近全精度且计算量减少8倍。 Conclusion: 该框架为视频抠图模型的量化提供了高效解决方案,显著提升了性能。 Abstract: Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model's ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.[86] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang
Main category: cs.CV
TL;DR: VRBench是一个针对大型模型多步推理能力的长叙事视频基准测试,填补了现有评估在时间推理和程序有效性上的不足。
Details
Motivation: 现有评估方法忽视了时间推理和程序有效性,VRBench旨在提供一个更全面的多步推理评估工具。 Method: VRBench包含1,010个长视频和大量标注数据,采用多阶段筛选和专家评审确保视频质量,并开发了人机协作框架生成多步推理链。 Result: 通过评估12个LLM和16个VLM,VRBench提供了多步推理领域的深入分析和见解。 Conclusion: VRBench为多步推理评估提供了新标准,推动了该领域的发展。 Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.[87] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu
Main category: cs.CV
TL;DR: CreatiPoster是一个生成可编辑多层图形设计的框架,通过自然语言指令或素材输入,结合协议模型和背景模型,实现高质量、可编辑的设计生成。
Details
Motivation: 当前AI工具在图形设计中难以准确整合用户素材、保持可编辑性和专业视觉效果,商业系统依赖模板库,难以复制。 Method: 使用协议模型生成JSON规范描述每一层内容,结合条件背景模型生成连贯背景。 Result: CreatiPoster在自动指标上超越开源和商业系统,并发布10万版权免费设计库。 Conclusion: CreatiPoster推动了AI辅助图形设计的普及,支持多种应用场景。 Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter[88] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement
Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 论文提出了一种改进零样本生成模型适应(ZSGM)的方法,通过分析CLIP嵌入空间中的偏移不对齐问题,并提出迭代细化(AIR)方法提升生成图像质量。
Details
Motivation: 现有ZSGM方法假设图像偏移与文本偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。论文旨在解决这一问题。 Method: 通过实证研究分析CLIP嵌入空间中的偏移不对齐现象,并提出迭代细化(AIR)方法,基于偏移不对齐的新见解改进生成质量。 Result: 在26个实验设置中,AIR方法在定性和定量评估以及用户研究中均达到最优性能。 Conclusion: 论文揭示了偏移不对齐与概念距离的关系,并提出AIR方法显著提升了零样本生成模型适应的性能。 Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.[89] M4V: Multi-Modal Mamba for Text-to-Video Generation
Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma
Main category: cs.CV
TL;DR: M4V是一种基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块和奖励学习策略,显著降低了计算成本并提升了视频质量。
Details
Motivation: 解决Transformer在文本到视频生成中计算复杂度高的问题,同时提升多模态和时空建模的效率。 Method: 提出多模态扩散Mamba(MM-DiM)块,结合多模态令牌重组设计和奖励学习策略。 Result: M4V在768×1280分辨率下生成视频时,FLOPs减少了45%,并在实验中展示了高质量视频生成能力。 Conclusion: M4V通过高效的多模态和时空建模,显著降低了计算成本,同时提升了视频生成质量。 Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.[90] SpectralAR: Spectral Autoregressive Visual Generation
Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Yueqi Duan,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 提出了一种基于频谱视角的自回归视觉生成框架(SpectralAR),通过嵌套频谱标记化实现视觉序列的因果性,并在粗到细的生成过程中提高效率。
Details
Motivation: 现有方法将视觉序列构建为空间块进行自回归生成,但图像块本质上是并行的,与自回归建模的因果性相矛盾。 Method: 首先通过嵌套频谱标记化将图像转换为有序的频谱标记,表示从低频到高频的分量,然后在频谱标记序列上进行自回归生成。 Result: 在ImageNet-1K上的实验表明,SpectralAR仅用64个标记和310M参数就达到了3.02 gFID。 Conclusion: SpectralAR通过频谱视角实现了视觉序列的因果性和高效性,为自回归视觉生成提供了新思路。 Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.[91] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
Main category: cs.CV
TL;DR: CDPruner是一种新的视觉令牌修剪方法,通过最大化条件多样性来优化多模态大语言模型(MLLMs)的推理成本,显著减少计算开销并保持高性能。
Details
Motivation: 当前视觉令牌修剪方法存在冗余或忽略指令相关性的问题,导致性能不佳。 Method: 提出CDPruner,基于条件多样性和DPP(行列式点过程)优化令牌选择。 Result: 在多种MLLMs上实现SOTA性能,显著减少FLOPs和延迟(如LLaVA中FLOPs减少95%,延迟减少78%),同时保持94%的原始准确率。 Conclusion: CDPruner通过条件多样性最大化,高效修剪视觉令牌,适用于多种MLLMs,显著提升推理效率。 Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.[92] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
Main category: cs.CV
TL;DR: GenWorld是一个高质量、大规模的真实世界模拟数据集,用于AI生成视频检测,提出了一种基于多视角一致性的检测方法SpannDetector。
Details
Motivation: 视频生成技术的普及威胁了真实信息的可信度,现有检测器因缺乏高质量真实数据集而受限。 Method: 构建GenWorld数据集,包含真实场景模拟、高质量伪造视频和多样提示模态;提出SpannDetector模型,利用多视角一致性检测AI生成视频。 Result: 实验表明SpannDetector在检测高质量AI生成视频上表现优异,揭示了基于物理合理性的可解释检测方向。 Conclusion: GenWorld数据集和SpannDetector方法为AI生成视频检测领域提供了重要推动力。 Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld[93] QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction
Sicheng Zuo,Wenzhao Zheng,Xiaoyong Han,Longchao Yang,Yong Pan,Jiwen Lu
Main category: cs.CV
TL;DR: 论文提出了一种基于超二次曲面的高效3D占用预测方法QuadricFormer,通过几何多样性和概率混合模型提升建模效率。
Details
Motivation: 现有方法在3D占用预测中效率低下,无法有效建模复杂几何结构,需要一种更高效的表示方法。 Method: 使用超二次曲面作为场景基元,开发概率超二次曲面混合模型,并提出QuadricFormer模型及剪枝-分裂模块。 Result: 在nuScenes数据集上,QuadricFormer实现了最先进的性能和高效率。 Conclusion: 超二次曲面基元能高效表示复杂结构,QuadricFormer在性能和效率上均表现优异。 Abstract: 3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.[94] Fine-Grained Perturbation Guidance via Attention Head Selection
Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim
Main category: cs.CV
TL;DR: 论文提出了一种名为HeadHunter的系统框架,通过细粒度选择注意力头来优化扩散模型中的生成质量和视觉属性,并引入SoftPAG方法调节扰动强度。
Details
Motivation: 现有注意力扰动方法缺乏确定扰动位置的原则性方法,尤其是在DiT架构中,质量相关计算分布在多层中。 Method: 研究注意力扰动的粒度,从层级到单个注意力头,发现特定头控制不同视觉概念;提出HeadHunter框架和SoftPAG方法。 Result: 在Stable Diffusion 3和FLUX.1等模型上验证,展示了在生成质量提升和风格特定指导上的优越性能。 Conclusion: 首次分析了扩散模型中注意力头的扰动,揭示了注意力层的可解释性分工,为设计有效扰动策略提供了实用方法。 Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.[95] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model
Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: InstaInpaint是一种基于参考的前馈框架,能在0.4秒内完成3D场景修复,速度提升1000倍,同时保持高性能。
Details
Motivation: 现有3D场景修复方法依赖耗时计算,无法满足实时或在线应用需求。 Method: 提出自监督掩码微调策略,训练定制化大型重建模型(LRM),并通过关键设计提升泛化性、纹理一致性和几何准确性。 Result: 在两大标准基准测试中保持领先性能,并成功应用于对象插入和多区域修复等灵活下游任务。 Conclusion: InstaInpaint在速度和性能上均显著优于现有方法,适用于实时3D场景修复。 Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.[96] SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis
Weiliang Chen,Jiayi Bi,Yuanhui Huang,Wenzhao Zheng,Yueqi Duan
Main category: cs.CV
TL;DR: SceneCompleter提出了一种通过密集3D场景补全实现3D一致生成新视图合成的新框架,解决了现有方法在几何和视觉上的不足。
Details
Motivation: 现有生成模型在2D补全后依赖3D恢复技术,导致几何失真和表面过度平滑。SceneCompleter旨在直接实现3D一致的生成新视图合成。 Method: 采用几何-外观双流扩散模型联合合成RGBD空间的新视图,并通过场景编码器从参考图像中获取更全面的场景理解。 Result: 方法在多样数据集上展示了生成新视图合成的优越一致性和合理性。 Conclusion: SceneCompleter通过融合结构和纹理信息,显著提升了生成新视图的视觉连贯性和3D一致性。 Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: https://chen-wl20.github.io/SceneCompletercs.GR [Back]
[97] Learning-based density-equalizing map
Yanwen Huang,Lok Ming Lui,Gary P. T. Choi
Main category: cs.GR
TL;DR: 论文提出了一种基于深度学习的新型密度均衡映射框架(LDEM),解决了传统方法在精度、重叠伪影和从2D扩展到3D时的挑战。
Details
Motivation: 传统密度均衡映射方法存在精度有限、极端情况下产生重叠伪影以及从2D扩展到3D时需要大量算法重构的问题。 Method: 提出了一种基于深度神经网络的框架,通过引入密度均匀性和几何规则性的损失函数,并采用分层方法预测粗粒度和密集级别的变换。 Result: 该方法在多种简单和复杂密度分布下表现出优于现有方法的密度均衡性和双射性,并可轻松应用于表面重网格化。 Conclusion: LDEM为密度均衡映射的扩展和稳健计算提供了新的可能性。 Abstract: Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or optimization-based methods that minimize handcrafted energy functionals. However, these conventional techniques often face several challenges: they may suffer from limited accuracy, produce overlapping artifacts in extreme cases, and require substantial algorithmic redesign when extended from 2D to 3D, due to the derivative-dependent nature of their energy formulations. In this work, we propose a novel learning-based density-equalizing mapping framework (LDEM) using deep neural networks. Specifically, we introduce a loss function that enforces density uniformity and geometric regularity, and utilize a hierarchical approach to predict the transformations at both the coarse and dense levels. Our method demonstrates superior density-equalizing and bijectivity properties compared to prior methods for a wide range of simple and complex density distributions, and can be easily applied to surface remeshing with different effects. Also, it generalizes seamlessly from 2D to 3D domains without structural changes to the model architecture or loss formulation. Altogether, our work opens up new possibilities for scalable and robust computation of density-equalizing maps for practical applications.[98] FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training
Fuhan Cai,Yong Guo,Jie Li,Wenbo Li,Xiangzhong Fang,Jian Chen
Main category: cs.GR
TL;DR: FastFLUX是一种架构级剪枝框架,通过BRLL方法和ST策略提升FLUX模型的推理效率,同时保持图像质量。
Details
Motivation: 现有T2I生成模型(如FLUX)参数庞大,导致推理慢、内存占用高且部署困难,现有加速方法性能下降明显且训练成本高。 Method: 提出Block-wise Replacement with Linear Layers (BRLL)方法替换复杂残差分支,并引入Sandwich Training (ST)策略进行局部微调。 Result: 实验表明FastFLUX在剪枝20%层次结构后仍保持高质量图像生成,同时显著提升推理速度。 Conclusion: FastFLUX有效解决了FLUX模型的效率问题,为T2I生成提供了高效解决方案。 Abstract: Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.[99] Token Perturbation Guidance for Diffusion Models
Javad Rajabi,Soroush Mehraban,Seyedmorteza Sadat,Babak Taati
Main category: cs.GR
TL;DR: 论文提出了一种名为Token Perturbation Guidance (TPG)的新方法,通过直接扰动扩散网络中的中间令牌表示来提升生成质量,无需特定训练且适用于条件和非条件生成。
Details
Motivation: 现有的Classifier-free guidance (CFG)需要特定训练且仅适用于条件生成,限制了其应用范围。TPG旨在克服这些限制。 Method: TPG使用保范数的洗牌操作直接扰动中间令牌表示,提供稳定且有效的指导信号。 Result: 实验表明,TPG在无条件生成中FID提升近2倍,同时在提示对齐方面与CFG表现接近。 Conclusion: TPG是一种通用且条件无关的指导方法,能够为更广泛的扩散模型带来类似CFG的优势。 Abstract: Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at https://github.com/TaatiTeam/Token-Perturbation-Guidance[100] Ambient Diffusion Omni: Training Good Models with Bad Data
Giannis Daras,Adrian Rodriguez-Munoz,Adam Klivans,Antonio Torralba,Constantinos Daskalakis
Main category: cs.GR
TL;DR: 论文提出Ambient Diffusion Omni框架,利用低质量、合成和分布外图像提升扩散模型质量,通过利用图像的光谱幂律衰减和局部性特性,显著改进生成效果。
Details
Motivation: 传统扩散模型依赖高质量数据集,但低质量图像中仍蕴含有价值信息,如何利用这些被丢弃的数据提升模型性能是研究动机。 Method: 提出Ambient Diffusion Omni框架,基于图像的光谱幂律衰减和局部性特性,从混合分布中提取信号,支持训练时使用所有可用图像。 Result: 在ImageNet上实现最佳FID分数,文本到图像生成的质量和多样性均有显著提升。 Conclusion: 噪声可平衡高质量与混合分布间的偏差,理论分析验证了方法的有效性,为利用低质量数据提供了新思路。 Abstract: We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.[101] Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On
Zaiqiang Wu,Yechen Li,Jingyuan Liu,Yuki Shibata,Takayuki Hori,I-Chao Shen,Takeo Igarashi
Main category: cs.GR
TL;DR: 提出了一种基于真实人体采集服装数据的方法,解决了现有虚拟试衣技术成本高、对齐不准确的问题,并通过混合人物表示提升了试衣效果。
Details
Motivation: 现有虚拟试衣技术依赖昂贵的机器人模型且无法准确模拟人体变形,导致合成服装与人体对齐不准确。 Method: 使用真实人体采集服装数据,结合简化的DensePose图改进中间表示,实现低成本、高精度的虚拟试衣。 Result: 在图像质量和时间一致性上优于现有方法,用户研究表明系统有助于服装购买决策。 Conclusion: 该方法降低了数据采集成本,提升了虚拟试衣的准确性和实用性。 Abstract: Existing image-based virtual try-on methods are often limited to the front view and lack real-time performance. While per-garment virtual try-on methods have tackled these issues by capturing per-garment datasets and training per-garment neural networks, they still encounter practical limitations: (1) the robotic mannequin used to capture per-garment datasets is prohibitively expensive for widespread adoption and fails to accurately replicate natural human body deformation; (2) the synthesized garments often misalign with the human body. To address these challenges, we propose a low-barrier approach for collecting per-garment datasets using real human bodies, eliminating the necessity for a customized robotic mannequin. We also introduce a hybrid person representation that enhances the existing intermediate representation with a simplified DensePose map. This ensures accurate alignment of synthesized garment images with the human body and enables human-garment interaction without the need for customized wearable devices. We performed qualitative and quantitative evaluations against other state-of-the-art image-based virtual try-on methods and conducted ablation studies to demonstrate the superiority of our method regarding image quality and temporal consistency. Finally, our user study results indicated that most participants found our virtual try-on system helpful for making garment purchasing decisions.[102] Edit360: 2D Image Edits to 3D Assets from Any Angle
Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang
Main category: cs.GR
TL;DR: Edit360是一个无需调整的框架,将2D图像编辑扩展到多视角一致的3D编辑,解决了现有方法在视角限制和一致性上的问题。
Details
Motivation: 扩散模型在图像生成和编辑方面取得了显著进展,但将其能力扩展到3D资产仍具挑战性,尤其是需要多视角一致性的细粒度编辑。现有方法通常限制编辑角度,缺乏灵活性。 Method: Edit360基于视频扩散模型,通过选择锚点视角进行2D编辑,并利用Anchor-View Editing Propagation机制在潜在和注意力空间中传播编辑,确保多视角一致性。 Result: Edit360能够从任意视角进行用户定制编辑,并重建高质量3D资产,支持可定制的3D内容创作。 Conclusion: Edit360为3D编辑提供了灵活且一致的解决方案,扩展了扩散模型在3D领域的应用。 Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.[103] Transformer IMU Calibrator: Dynamic On-body IMU Calibration for Inertial Motion Capture
Chengxu Zuo,Jiawei Huang,Xiao Jiang,Yuan Yao,Xiangren Shi,Rui Cao,Xinyu Yi,Feng Xu,Shihui Guo,Yipeng Qin
Main category: cs.GR
TL;DR: 提出了一种新型动态校准方法,突破了IMU校准中的绝对静态假设限制,实现了实时估计RG'G和RBS,显著扩展了应用场景。
Details
Motivation: 传统IMU校准方法依赖绝对静态假设(RG'G和RBS恒定),限制了应用场景。本文旨在通过动态校准方法解决这一问题。 Method: 基于两个宽松假设(短时窗口内矩阵变化微小且运动多样),利用Transformer模型学习RG'G、RBS与IMU读数的映射,并设计校准触发器确保多样性。 Result: 首次实现隐式IMU校准(无需显式校准过程),并支持稀疏IMU的长期精确运动捕捉。代码和数据集已开源。 Conclusion: 该方法突破了传统限制,为稀疏IMU系统提供了更灵活、高效且准确的动态校准方案。 Abstract: In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG'G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG'G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG'G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at https://github.com/ZuoCX1996/TIC.cs.CL [Back]
[104] A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations
Tian Lan,Yang-Hao Zhou,Zi-Ao Ma,Fanshu Sun,Rui-Qing Sun,Junyu Luo,Rong-Cheng Tu,Heyan Huang,Chen Xu,Zhijing Wu,Xian-Ling Mao
Main category: cs.CL
TL;DR: 本文综述了生成内容(文本、图像、音频)的自动评估方法,提出了一种统一的分类框架,并探讨了未来跨模态评估的研究方向。
Details
Motivation: 当前缺乏一个系统性的框架来全面组织跨模态(文本、视觉、音频)的生成内容评估方法。 Method: 通过综述现有方法,提出五种基本范式,并扩展至图像和音频生成领域。 Result: 建立了一个统一的分类框架,展示了其在多模态中的适用性。 Conclusion: 未来研究应关注跨模态评估方法的进一步发展。 Abstract: Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.[105] TaskCraft: Automated Generation of Agentic Tasks
Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: TaskCraft是一个自动化工作流,用于生成可扩展、多工具且可验证的代理任务,解决了现有指令数据缺乏工具交互和人工标注成本高的问题。
Details
Motivation: 代理任务在NLP和AI中的重要性日益增加,但现有数据缺乏工具交互,且依赖昂贵的人工标注,限制了可扩展性。 Method: TaskCraft通过深度和宽度扩展原子任务,生成结构和层次复杂的挑战,并支持大规模合成数据集。 Result: 实验表明,TaskCraft生成的任务改进了提示优化和监督微调,并提供了约36,000个任务的数据集。 Conclusion: TaskCraft为代理任务的调优和评估提供了可扩展的解决方案,支持未来研究。 Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.[106] A quantum semantic framework for natural language processing
Christopher J. Agostino,Quan Le Thien,Molly Apsel,Denizhan Pak,Elina Lesyk,Ashabari Majumdar
Main category: cs.CL
TL;DR: 论文探讨了语义退化对自然语言理解的影响,提出意义是通过观察者依赖的解释行为实现的,并通过实验验证了语言解释的非经典语境性。
Details
Motivation: 研究动机是揭示大型语言模型(LLMs)和其他NLP系统在复杂语义表达中恢复单一意图意义的局限性,挑战语言形式本身具有意义的传统观点。 Method: 方法包括使用Kolmogorov复杂性理论分析语义退化,并通过语义Bell不等式测试,利用LLM代理在多样化上下文设置中解释模糊词对。 Result: 实验结果显示CHSH期望值在1.2到2.8之间,部分结果显著违反经典边界(|S|≤2),表明语言解释在模糊性下表现出非经典语境性。 Conclusion: 结论指出经典频率主义分析方法对自然语言是损失性的,建议采用贝叶斯式重复采样方法更实用地表征上下文中的语言意义。 Abstract: Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression's complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems'' to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ($|S|\leq2$). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.[107] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information
Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel
Main category: cs.CL
TL;DR: 本文提出了一种名为Chat-of-Thought的多智能体系统,用于优化工业资产FMEA文档的生成与验证。
Details
Motivation: 工业设备监控领域需要高效生成和验证FMEA文档,传统方法效率低下,因此提出基于多智能体的解决方案。 Method: 采用多角色协作的LLM智能体,结合动态任务路由和Chat of Thought机制,实现内容的迭代优化。 Result: 系统通过模板驱动的工作流和上下文感知的智能体协作,有效解决了工业设备监控中的挑战。 Conclusion: Chat-of-Thought展示了在工业领域应用多智能体系统的潜力,为FMEA文档生成提供了创新解决方案。 Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.[108] When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs
Xiao Li,Joel Kreuzwieser,Alan Peters
Main category: cs.CL
TL;DR: 论文研究了语义相同但表达不同的提示对大型语言模型(LLM)行为的影响,提出了Prompt-Based Semantic Shift(PBSS)框架,用于测量模型在语义等效提示下的行为漂移。
Details
Motivation: 探讨LLM在语义相同但表达不同的提示下是否会产生行为差异,揭示模型评估稳定性中可能被忽视的维度。 Method: 提出PBSS框架,应用于10个约束任务,分析模型在语义等效提示下的行为漂移。 Result: 发现模型在语义等效提示下存在一致的行为漂移,与分词和解码策略相关。 Conclusion: 提示的分词和解码动态可能影响模型的服务质量稳定性,需在模型评估中重视这一维度。 Abstract: We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.[109] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering
Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: 论文提出ChartReasoner,一种代码驱动的两阶段框架,用于解决视觉推理任务中信息丢失的问题,通过将图表转换为结构化代码并生成高质量推理数据,实现了精确且可解释的图表推理。
Details
Motivation: 现有方法将视觉推理任务转化为文本推理任务时,常丢失关键视觉信息,尤其是在需要大量视觉细节的任务(如图表问答)中。 Method: ChartReasoner首先将图表图像转换为结构化ECharts代码,然后设计数据合成管道生成高质量推理轨迹,最后通过监督微调和强化学习训练多模态模型。 Result: 在四个公开基准测试中,ChartReasoner表现优异,接近GPT-4o等专有系统的性能,同时参数更少。 Conclusion: ChartReasoner有效解决了视觉推理中的信息丢失问题,为图表推理任务提供了一种高效且可扩展的解决方案。 Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.[110] Unsupervised Elicitation of Language Models
Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike
Main category: cs.CL
TL;DR: 论文提出了一种无监督算法ICM,用于微调预训练语言模型,无需外部监督,性能优于人类监督方法。
Details
Motivation: 针对超人类能力的模型难以获取高质量人类监督的问题,提出无监督方法。 Method: 引入Internal Coherence Maximization (ICM)算法,利用模型自身生成的标签进行微调。 Result: 在多个任务中,ICM性能与黄金监督相当,优于众包人类监督;在超人类任务中表现更优。 Conclusion: ICM能有效提升前沿语言模型的训练效果,优于人类监督方法。 Abstract: To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.[111] When Large Language Models are Reliable for Judging Empathic Communication
Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Erina Farrell,Bruce Lambert,Matthew Groh
Main category: cs.CL
TL;DR: 研究比较了专家、众包工作者和大型语言模型(LLM)在四种心理学、自然语言处理和传播学框架下对共情沟通的标注可靠性。LLM表现接近专家水平,优于众包工作者。
Details
Motivation: 探讨LLM在共情沟通判断中的可靠性,为其在情感敏感应用中的透明性和监督提供依据。 Method: 使用四种评估框架,分析200个真实对话,比较专家、众包工作者和LLM的标注一致性和可靠性。 Result: 专家间一致性高但受框架子组件影响;LLM表现接近专家水平,优于众包工作者。 Conclusion: LLM在特定任务中表现可靠,可用于情感敏感应用,如对话伴侣,但需适当验证和基准。 Abstract: Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.[112] Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME
Bidyarthi Paul,SM Musfiqur Rahman,Dipta Biswas,Md. Ziaul Hasan,Md. Zahid Hossain
Main category: cs.CL
TL;DR: 该研究利用多种机器学习模型(如Linear SVM、KNN、Random Forest、BiLSTM和AdaBoost)对孟加拉语社交媒体评论进行情感分析,并探讨了PCA降维和LIME解释模型预测的效果。
Details
Motivation: 研究旨在推动资源有限语言(如孟加拉语)的情感分析,尤其是针对其独特的地区表达和文化特征。 Method: 使用EmoNoBa数据集中的22,698条社交媒体评论,结合TF-IDF向量化器和n-gram特征,应用多种机器学习模型(Linear SVM、KNN、Random Forest、BiLSTM和AdaBoost),并研究PCA降维和LIME解释模型预测的效果。 Result: 通过比较不同模型和技术,研究找到了适用于孟加拉语情感分析的高效方法。 Conclusion: 该研究为资源有限语言的情感分析提供了有效技术,并展示了模型解释工具(如LIME)在提升模型可理解性方面的潜力。 Abstract: Research on understanding emotions in written language continues to expand, especially for understudied languages with distinctive regional expressions and cultural features, such as Bangla. This study examines emotion analysis using 22,698 social media comments from the EmoNoBa dataset. For language analysis, we employ machine learning models: Linear SVM, KNN, and Random Forest with n-gram data from a TF-IDF vectorizer. We additionally investigated how PCA affects the reduction of dimensionality. Moreover, we utilized a BiLSTM model and AdaBoost to improve decision trees. To make our machine learning models easier to understand, we used LIME to explain the predictions of the AdaBoost classifier, which uses decision trees. With the goal of advancing sentiment analysis in languages with limited resources, our work examines various techniques to find efficient techniques for emotion identification in Bangla.[113] Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities
Elizabeth Demers,Victor Xiaoqi Wang,Kean Wu
Main category: cs.CL
TL;DR: 论文提出了一种基于机器学习的词汇表开发方法,用于测量和披露人力资本(HC)的多维管理。
Details
Motivation: 人力资本对公司价值创造日益重要,但缺乏明确的测量和披露规则。 Method: 使用word2vec算法训练HC披露数据,开发包含五个子类别的HC相关关键词词汇表。 Result: 提供了HC词汇表、公司披露数据和Python代码,支持进一步研究和BERT模型微调。 Conclusion: 未来研究可基于此词汇表探索HC管理和披露的更多问题。 Abstract: Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.[114] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
Yi Wang,Max Kreminski
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)在故事生成中的能力,通过叙事规划问题评估其表现,发现GPT-4级模型在小规模故事生成中因果逻辑良好,但角色意图和戏剧冲突仍具挑战。
Details
Motivation: 由于自动评估方法的局限性和人工评估的高成本及主观性,对LLMs生成高质量故事能力的理解有限,研究旨在通过叙事规划问题深化这一理解。 Method: 利用LLMs解决叙事规划问题,基于文学例子设计评估基准,关注因果逻辑、角色意图和戏剧冲突。 Result: GPT-4级LLMs能生成小规模因果逻辑合理的故事,但角色意图和戏剧冲突仍需强化学习训练的LLMs支持。 Conclusion: 研究揭示了LLMs在故事生成中的能力边界,为游戏环境中应用LLM叙事规划提供了挑战与启示。 Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.[115] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta,Francis Ferraro
Main category: cs.CL
TL;DR: Q2E是一种基于LLMs和VLMs的零样本多语言文本到视频检索方法,通过分解查询提升复杂事件的理解和检索效果。
Details
Motivation: 提升复杂现实事件视频的识别和检索能力,利用LLMs和VLMs的潜在参数知识。 Method: 提出Q2E方法,分解查询并利用LLMs和VLMs的知识,支持多模态输入(视觉和语音),采用基于熵的融合评分。 Result: 在多个数据集和检索指标上优于现有基线方法,音频信息的整合显著提升检索效果。 Conclusion: Q2E方法有效提升了复杂事件视频的检索性能,支持多模态输入,代码和数据已开源。 Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.[116] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: TTT-Bench是一个新的基准测试,用于评估大型推理模型在简单战略、空间和逻辑推理任务中的表现,发现这些模型在解决人类轻松完成的游戏任务时表现不佳。
Details
Motivation: 尽管大型推理模型在STEM领域表现出色,但其在更广泛任务领域的推理能力尚未充分探索。 Method: 通过设计四种类似井字棋的双人游戏任务,采用可扩展的程序化方法生成问题,评估模型的推理能力。 Result: 大型推理模型在TTT-Bench上的表现显著低于数学任务,且在长期战略推理中表现更差。 Conclusion: TTT-Bench揭示了大型推理模型在简单但需要战略和空间推理的任务中的局限性。 Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41\% \& $\downarrow$ 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.[117] Classifying Unreliable Narrators with Large Language Models
Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi
Main category: cs.CL
TL;DR: 论文提出了一种计算方法来识别不可靠的叙述者,并构建了一个多领域的人类标注数据集TUNa,测试了多种LLM在此任务上的表现。
Details
Motivation: 研究动机在于利用计算方法识别叙述者是否可靠,尤其是那些无意中歪曲信息的叙述者。 Method: 结合叙事学理论定义不可靠叙述者类型,构建TUNa数据集,并测试LLM在分类任务中的表现,包括少样本学习、微调和课程学习。 Result: 结果显示任务极具挑战性,但LLM在识别不可靠叙述者方面具有潜力。 Conclusion: 论文发布了标注数据集和代码,鼓励未来在此领域的研究。 Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.[118] ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese
Iago Alves Brito,Julia Soares Dollis,Fernanda Bufon Färber,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: ToxSyn-PT是首个大规模葡萄牙语语料库,支持对九个受法律保护的少数群体进行细粒度仇恨言论分类。通过四阶段流程生成,实验显示其在多任务分类中表现优异。
Details
Motivation: 解决葡萄牙语仇恨言论数据稀缺问题,提供多样化和平衡的数据集。 Method: 采用四阶段流程:手动种子、少样本扩展、基于释义的增强和中性文本补充。 Result: 生成的语料库在多任务分类中表现优异,具有跨领域泛化能力。 Conclusion: ToxSyn-PT为低资源环境下的仇恨言论检测研究提供了重要资源。 Abstract: We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.[119] Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models
Andrea Yaoyun Cui,Pengfei Yu
Main category: cs.CL
TL;DR: 语言模型在特定条件下可能表现出近乎确定性的决策行为,挑战了传统的采样假设,并揭示了模拟Gibbs采样可能导致的“虚假先验”问题。
Details
Motivation: 探讨语言模型是否具有贝叶斯大脑的特性,并验证其决策行为的随机性与确定性。 Method: 通过实验分析多种大型语言模型在不同条件下的决策模式,提出区分随机与确定性决策的方法。 Result: 发现语言模型在非零采样温度下仍可能产生最大似然估计,表现出确定性行为,且模拟Gibbs采样可能导致虚假先验。 Conclusion: 研究揭示了语言模型决策行为的复杂性,为理解其决策机制提供了新视角,并提出了避免误导性先验推断的方法。 Abstract: Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a "false prior." To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.[120] ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs
Zige Wang,Qi Zhu,Fei Mi,Minghui Xu,Ruochun Jin,Wenjing Yang
Main category: cs.CL
TL;DR: 提出了一种基于聚类和改进UCB算法的高效梯度数据选择框架ClusterUCB,显著降低计算资源消耗。
Details
Motivation: 传统梯度数据选择方法计算资源消耗过大,难以实际应用。 Method: 通过聚类将数据分组,利用改进的UCB算法解决预算分配问题,平衡探索与利用。 Result: 实验表明ClusterUCB在性能接近原方法的同时大幅减少计算消耗。 Conclusion: ClusterUCB为高效数据选择提供了可行方案。 Abstract: Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.[121] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages
Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak
Main category: cs.CL
TL;DR: 论文提出了一种名为Flick的新方法,用于解决低资源语言环境下的少标签文本分类问题,通过改进伪标签质量提升模型性能。
Details
Motivation: 减少对大量标注数据的依赖,解决现有方法在低资源语言环境中伪标签噪声和领域适应问题。 Method: Flick通过从初始广泛聚类中提取高置信度伪标签,并引入伪标签细化组件,利用单聚类凝聚力和自适应top-k选择机制。 Result: 在14个多样化数据集(包括阿拉伯语、乌尔都语等低资源语言)上展示了Flick的优越性能和适应性。 Conclusion: Flick通过改进伪标签质量,显著提升了低资源语言环境下的少标签分类效果。 Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.[122] "Check My Work?": Measuring Sycophancy in a Simulated Educational Context
Chuck Arvin
Main category: cs.CL
TL;DR: 研究探讨了用户建议对大型语言模型(LLMs)在教育模拟场景中的影响,发现模型回答质量受提问方式显著影响,且小模型更容易表现出迎合行为。
Details
Motivation: 探讨LLMs在教育场景中因用户建议而产生的迎合行为及其对教育公平的影响。 Method: 测试了五款不同规模的LLMs(来自OpenAI GPT-4o和GPT-4.1系列),在五种实验条件下评估回答质量的变化。 Result: 当学生提到错误答案时,LLMs的正确率下降15个百分点;提到正确答案时提升15个百分点。小模型的迎合行为更强(GPT-4.1-nano达30%,GPT-4o为8%)。 Conclusion: LLMs的迎合行为可能加剧教育不平等,需进一步研究其机制及缓解方法。 Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs "flip" their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.[123] Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
Hayato Futami,Emiru Tsunoo,Yosuke Kashiwagi,Yuki Ito,Hassan Shahmohammadi,Siddhant Arora,Shinji Watanabe
Main category: cs.CL
TL;DR: 论文提出了一种通过交替训练语音和文本单元的方法,逐步减少文本比例以解决LLMs在语音模态适应中的困难,显著提升了语音翻译性能。
Details
Motivation: LLMs基于文本数据训练,难以适应语音模态,尤其是在语音数据有限的情况下。 Method: 采用交替的语音-文本单元训练,逐步减少文本比例,实现从文本到语音的渐进式模态适应。 Result: 在CVSS数据集上微调LLaMA3.2-1B,翻译性能显著提升,尤其是对训练数据较少的语言。 Conclusion: 提出的方法有效解决了LLMs在语音模态适应中的挑战,提升了语音翻译性能。 Abstract: Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech--text training in this study. We use interleaved speech--text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.[124] Code Execution as Grounded Supervision for LLM Reasoning
Dongwon Jung,Wenxuan Zhou,Muhao Chen
Main category: cs.CL
TL;DR: 提出一种通过程序执行确定性生成高质量CoT监督数据的方法,提升LLM推理能力。
Details
Motivation: 现有CoT监督数据生成方法依赖昂贵人工标注或易出错的LLM生成,难以保证可靠性。 Method: 利用代码执行提取可验证的逐步推理轨迹,转化为自然语言CoT推理。 Result: 实验表明该方法能有效提升LLM跨任务推理能力,并减少推理时的冗余重复。 Conclusion: 该方法生成高精度推理数据,优化推理效率,具有可扩展性。 Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.[125] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu,Pu Jian,Chong Chen
Main category: cs.CL
TL;DR: TableRAG提出了一种混合框架,解决了现有RAG方法在处理异构文档时的局限性,通过结合文本检索和表格操作,显著提升了问答性能。
Details
Motivation: 现有RAG方法在处理包含文本和表格的异构文档时存在信息丢失和推理能力不足的问题,TableRAG旨在解决这些问题。 Method: TableRAG采用四步迭代流程:查询分解、文本检索、SQL编程与执行、中间答案生成,并开发了HeteQA基准进行评估。 Result: 实验表明,TableRAG在公共数据集和HeteQA上均优于现有基线,成为异构文档问答的新标杆。 Conclusion: TableRAG通过统一文本和表格处理,显著提升了异构文档问答的性能,并开源了框架。 Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.[126] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan
Main category: cs.CL
TL;DR: 论文提出PAG框架,通过统一的多轮强化学习范式让大语言模型在策略和验证者角色间切换,实现自我修正。
Details
Motivation: 大语言模型在复杂推理任务中表现优异,但难以可靠验证自身输出的正确性,现有方法依赖额外验证模型或多阶段训练,限制了可扩展性。 Method: 提出PAG框架,结合策略和验证者角色,通过选择性修正机制(仅在检测到错误时修正)和多轮强化学习实现自我验证与修正。 Result: 实验表明PAG在推理和验证能力上均有提升,作为策略提高了生成和自我修正的准确性,作为验证者其自我验证优于自我一致性。 Conclusion: PAG框架通过统一的多轮强化学习范式,有效提升了大语言模型的自我验证和修正能力,同时避免了模型崩溃。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.[127] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt
Main category: cs.CL
TL;DR: TempVS是一个专注于多模态大语言模型(MLLMs)在图像序列中时间推理能力的基准测试,包含三个主要测试和基础测试,评估了38个先进模型,发现其表现远低于人类水平。
Details
Motivation: 研究MLLMs在时间推理和基础能力上的表现,揭示其与人类能力的差距。 Method: 设计TempVS基准测试,包含事件关系推理、句子排序和图像排序三个任务,并评估38个MLLMs。 Result: MLLMs在TempVS上表现不佳,与人类能力存在显著差距。 Conclusion: TempVS揭示了MLLMs在时间推理上的不足,为未来研究提供了方向。 Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.[128] Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting
Avneet Kaur,Arnav Arora
Main category: cs.CL
TL;DR: 研究分析了新闻媒体在以色列-巴勒斯坦冲突报道中的战争与和平新闻框架,揭示了战争报道更受关注,并展示了美、英和中东媒体在冲突责任归属上的偏见。
Details
Motivation: 当前冲突框架研究多限于定性或浅层分析,缺乏对战争与和平新闻框架的深入探讨。 Method: 结合框架语义学和大型语言模型,计算分析新闻语料中的战争与和平新闻指标。 Result: 发现战争报道占主导,美、英和中东媒体在冲突责任归属上存在显著差异。 Conclusion: 媒体框架对冲突报道有显著影响,揭示了潜在的媒体偏见。 Abstract: Framing used by news media, especially in times of conflict, can have substantial impact on readers' opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.[129] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty
Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng
Main category: cs.CL
TL;DR: 该研究提出了一种优化大型语言模型(LLM)推理效率的方法,通过动态调整输出长度惩罚,针对简单问题缩短输出,复杂问题保留足够推理,从而提升整体性能。
Details
Motivation: 现有方法如Chain-of-Thought提示虽提升推理能力,但导致输出过长,增加计算延迟;而强化学习方法对输出长度施加统一惩罚,未考虑问题复杂性,效果不佳。 Method: 通过分割奖励函数并引入新的输出长度惩罚机制,动态管理模型的推理效率。 Result: 在GSM8K和MATH500数据集上缩短了输出长度且保持或提升准确率;在更复杂的AIME2024数据集上提高了准确率。 Conclusion: 该方法有效平衡了输出长度与推理准确性,显著提升了LLM的整体性能。 Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.[130] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
Main category: cs.CL
TL;DR: 论文将表格与文本对齐任务重新定义为解释任务,要求模型识别支持或反驳科学主张的关键表格单元格,并构建了包含人工标注的数据集。实验表明,结合表格对齐信息可提升验证性能,但大多数LLM无法恢复人类对齐的推理路径。
Details
Motivation: 当前科学主张验证仅预测标签,缺乏解释性和模型推理的可信度。 Method: 将表格文本对齐任务重新定义为解释任务,构建带有人工标注单元格的数据集,并提出处理模糊情况的分类法。 Result: 结合表格对齐信息提高了验证性能,但多数LLM无法恢复人类对齐的推理路径。 Conclusion: 模型预测的正确标签未必源于可信推理,需进一步改进解释性。 Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.[131] Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
Aleksandra Sorokovikova,Pavel Chizhov,Iuliia Eremenko,Ivan P. Yamshchikov
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)中的偏见问题,发现通过不同任务(如评分、薪资谈判建议)可以揭示模型的显著偏见,尤其是在模型已了解用户社会人口特征时。
Details
Motivation: 研究现代语言模型在训练数据中不可避免的偏见内容,以及这些偏见如何通过模型输出表现出来。 Method: 通过预提示角色、多学科基准测试(MMLU)和任务重构(如评分、薪资谈判建议)来评估模型的偏见。 Result: 预提示角色对分数影响微小且随机,但任务重构(如评分和薪资谈判建议)显示出更显著的偏见迹象。 Conclusion: 随着LLM助手记忆和个性化趋势的发展,模型已了解用户社会人口特征,偏见问题将更加突出。 Abstract: Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user's answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.[132] Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
Sangmin Song,Juhwan Choi,JungMin Yun,YoungBin Kim
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在多用户对话状态跟踪(DST)中的表现,发现其在多用户场景下性能显著下降,并提出了改进方向。
Details
Motivation: 现有DST基准主要关注结构化用户-代理对话,未能反映真实多用户交互的复杂性,因此需要评估LLMs在多用户DST中的鲁棒性。 Method: 通过基于言语行为理论生成第二用户的对话内容,扩展现有DST数据集,系统性地引入多用户交互,以评估LLMs的表现。 Result: 实验结果表明,与单用户DST相比,LLMs在多用户场景下的性能显著下降,突显了当前模型的局限性。 Conclusion: 研究强调了未来需改进LLMs以适应多用户DST场景,以推动更真实、鲁棒的DST模型发展。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.[133] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang
Main category: cs.CL
TL;DR: 论文提出RRP框架,通过结合知识图谱和大型语言模型,优化推理路径生成,提升复杂问题解决能力。
Details
Motivation: 大型语言模型在知识密集型任务中表现不佳,缺乏背景知识且易产生幻觉,现有方法虽结合知识图谱但仍难以解决复杂问题。 Method: 提出RRP框架,利用关系嵌入和双向分布学习结合语义与结构信息,并引入反思模块优化推理路径。 Result: 在两个公开数据集上,RRP表现优于现有基线方法,并可轻松集成到不同大型语言模型中。 Conclusion: RRP通过生成高质量推理路径,为大型语言模型提供有效指导,提升其推理能力。 Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.[134] Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search
Promise Dodzi Kpoglu
Main category: cs.CL
TL;DR: 提出了一种无监督方法,结合数据驱动和规则启发式,用于重建原始语言形式,显著优于现有基线。
Details
Motivation: 现有方法主要依赖数据驱动的概率模型,限制了其性能。 Method: 结合数据驱动推断和规则启发式,在进化优化框架下进行重建。 Result: 在拉丁语原始形式重建任务中,字符级准确性和音系合理性指标均有显著提升。 Conclusion: 混合方法优于纯数据驱动方法,为语言重建提供了新方向。 Abstract: We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.[135] SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis
Sergio Burdisso,Esaú Villatoro-Tello,Petr Motlicek
Main category: cs.CL
TL;DR: SDialog是一个模块化、可扩展的Python工具包,用于生成和分析合成对话,利用指令调优的大型语言模型(LLMs)支持多代理模拟和场景驱动生成。
Details
Motivation: 为对话AI系统提供高质量、灵活且可复现的合成对话数据,以支持训练、评估和基准测试。 Method: 通过指令调优的LLMs,提供角色、编排和场景管理的抽象,生成多样且可控的对话数据。 Result: SDialog支持多代理模拟和场景驱动生成,推动了合成数据生成工具和框架的标准化。 Conclusion: SDialog为快速发展的研究领域提供了可复现的工具,是对话AI研究的重要进展。 Abstract: The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.[136] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal
Main category: cs.CL
TL;DR: 本文介绍了BEA 2025共享任务中用于错误识别的系统,结合了多种方法,最终通过检索增强的提示系统表现最佳。
Details
Motivation: 评估AI导师是否能正确识别学生数学推理中的错误,提升教学反馈能力。 Method: 探索了四种方法:(1) 多预训练语言模型的集成;(2) 冻结的句子转换器;(3) 历史感知模型;(4) 检索增强的少样本提示系统。 Result: 最终系统表现优于基线,证明了结合示例驱动提示与LLM推理的有效性。 Conclusion: 检索增强的提示系统在错误识别任务中表现最佳,代码已开源。 Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.[137] Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters
Tatsuya Hiraoka,Kentaro Inui
Main category: cs.CL
TL;DR: LLMs能逐字符拼写但难以处理复杂字符任务,研究发现其字符级信息依赖中间层重建。
Details
Motivation: 探究LLMs在拼写过程中如何内部表示和利用字符级信息。 Method: 通过探测分类器、知识神经元识别和注意力权重检查三种分析验证机制。 Result: 发现嵌入层未完全编码字符级信息,LLMs依赖中间层重建字符知识。 Conclusion: LLMs的拼写行为依赖中间层突破,而非简单的字符级编码。 Abstract: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.[138] Large Language Models for Detection of Life-Threatening Texts
Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
Main category: cs.CL
TL;DR: 本文提出了一种使用大型语言模型(LLMs)检测威胁生命语言的有效方法,并与传统方法进行了比较。实验结果显示LLMs表现优异,尤其是Mistral和Llama-2模型。
Details
Motivation: 检测威胁生命的语言对保护个体、促进心理健康和预防潜在伤害至关重要。 Method: 通过微调三种开源LLMs(Gemma、Mistral、Llama-2)并与传统方法(如词袋模型、词嵌入、主题建模和BERT)进行比较。 Result: LLMs在平衡和不平衡数据场景中表现优于传统方法,Mistral和Llama-2表现最佳。上采样技术对传统方法有益,但对LLMs影响较小。 Conclusion: LLMs在现实世界的威胁生命语言检测中具有巨大潜力。 Abstract: Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.[139] Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet
Lorenzo Augello,John P. McCrae
Main category: cs.CL
TL;DR: 本文探讨如何建立形容词之间的上位关系,提出了一种新资源并微调大语言模型来预测形容词的上位关系。
Details
Motivation: Open English Wordnet中缺少许多链接,尤其是形容词之间的上位关系,本文旨在填补这一空白。 Method: 通过理论讨论形容词上位关系的特性,开发新资源,并微调TaxoLLaMa模型来预测形容词的上位关系。 Result: 成功开发了形容词上位关系的新资源,并验证了TaxoLLaMa模型在此任务中的适应性。 Conclusion: 本文为形容词上位关系提供了理论和方法支持,扩展了TaxoLLaMa的应用范围。 Abstract: Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.[140] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Ye Yu,Yaoning Yu,Haohan Wang
Main category: cs.CL
TL;DR: PREMISE框架通过提示优化减少大型推理模型的冗余计算,保持准确性同时显著降低成本和令牌使用。
Details
Motivation: 现有大型推理模型的链式推理过程冗长,增加了令牌使用和成本,限制了在延迟敏感或API受限场景中的部署。 Method: PREMISE结合轨迹诊断和梯度启发式提示优化,通过多目标文本搜索平衡令牌长度和答案有效性。 Result: 在多个数学基准测试中,PREMISE保持或提升准确性(如Claude 96%→96%,Gemini 91%→92%),同时减少87.5%的推理令牌和69-82%的成本。 Conclusion: 提示级优化是提升大型推理模型效率的可扩展方法,且不影响推理质量。 Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by $69$--$82\%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.[141] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta,Runchu Tian,Jiawei Han
Main category: cs.CL
TL;DR: ClaimSpect框架通过层次化分解和检索增强生成技术,自动构建复杂主张的多角度分析,并提供语料库特定视角的丰富信息。
Details
Motivation: 现实中的主张(如科学或政治主张)通常具有复杂性,难以简单标记为“真”或“假”,需要多角度验证。 Method: 提出ClaimSpect框架,通过检索增强生成技术,层次化分解主张并检索相关语料片段,发现子方面和不同观点。 Result: 在真实科学和政治主张数据集上验证了框架的鲁棒性和准确性,优于多个基线方法。 Conclusion: ClaimSpect能有效解构复杂主张,提供多角度分析,适用于科学和政治领域。 Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.[142] TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
Priyanka Kargupta,Nan Zhang,Yunyi Zhang,Rui Zhang,Prasenjit Mitra,Jiawei Han
Main category: cs.CL
TL;DR: TaxoAdapt是一个动态调整LLM生成分类框架的方法,通过多维分类提升科学文献的组织和检索能力。
Details
Motivation: 科学领域快速演变,传统专家分类耗时昂贵,现有自动方法缺乏通用性和动态适应性。 Method: TaxoAdapt通过迭代层次分类,扩展分类宽度和深度,适应多维度科学文献。 Result: 在计算机科学会议数据上表现优异,分类更细粒度且连贯性提升50.41%。 Conclusion: TaxoAdapt能有效捕捉科学领域动态演变,优于现有基线方法。 Abstract: The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.[143] One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan,Alejandro R. Salamanca,Andres Felipe Cruz-Salinas,Kris Cao,Hangyu Lin,Acyr Locatelli,Marzieh Fadaee,Ahmet Üstün,Sara Hooker
Main category: cs.CL
TL;DR: 研究探讨了在预训练早期采用低成本干预(如通用分词器设计)如何提升多语言大模型的语言可塑性,使其在训练后能更高效地适应新语言。
Details
Motivation: 多语言大模型预训练面临模型容量有限、高质量数据稀缺和计算资源限制等挑战,且分词器对新语言的支持不足,影响训练后的适应能力。 Method: 提出使用通用分词器(覆盖更多语言)替代仅针对预训练语言的分词器,以提升模型对新语言的适应能力。 Result: 实验表明,通用分词器显著提升了语言适应能力,胜率最高提升20.2%,且对未见语言也有5%的胜率增益。 Conclusion: 通用分词器设计是一种低成本高效的方法,可在不显著影响预训练语言性能的前提下,显著提升模型对新语言的适应能力。 Abstract: Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.[144] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs
Alberto Testoni,Iacer Calixto
Main category: cs.CL
TL;DR: 论文评估了十种开源大语言模型在临床多选题回答中的不确定性估计方法,发现轻量级单次生成方法接近语义熵性能,且模型选择需结合问题类型和模型特点。
Details
Motivation: 高风险领域(如临床决策支持)需要准确且校准良好的不确定性估计,以提升大语言模型的应用可靠性。 Method: 通过两个数据集、十一种医学专业和六种问题类型,比较了标准单次生成和基于采样的方法,并探索了基于推理行为信号的轻量级单次生成估计器。 Result: 轻量级方法性能接近语义熵,且模型表现因专业和问题类型而异。 Conclusion: 模型选择应结合问题类型和模型特点,轻量级单次生成方法在不确定性估计中具有潜力。 Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.[145] Improving Named Entity Transcription with Contextual LLM-based Revision
Viet Anh Trinh,Xinlu He,Jacob Whitehill
Main category: cs.CL
TL;DR: 论文提出了一种利用大语言模型(LLM)修正ASR预测中错误命名实体的方法,显著降低了命名实体的词错误率(WER)。
Details
Motivation: 当前ASR系统在命名实体识别上表现不佳,而命名实体是关键信息,错误识别会影响下游应用。 Method: 通过结合LLM的推理能力和局部上下文(如讲义)来修正ASR预测中的命名实体。 Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER相对降低了30%。 Conclusion: 该方法有效提升了ASR系统在命名实体识别上的准确性。 Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.[146] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints
Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens
Main category: cs.CL
TL;DR: LangEdit提出了一种新颖的空空间约束框架,用于在多语言大语言模型中高效更新知识,同时避免参数干扰。
Details
Motivation: 在多语言大语言模型中高效更新知识并保持跨语言一致性是一个未解决的挑战,现有方法成本高且易导致参数干扰。 Method: LangEdit通过将每种语言的参数更新投影到先前更新子空间的正交补空间,确保更新独立性和多语言泛化能力。 Result: 在三种模型架构、六种语言和四项下游任务上的评估表明,LangEdit有效减少了参数干扰,性能优于现有方法。 Conclusion: LangEdit为多语言大语言模型中的知识更新提供了一种高效且准确的解决方案。 Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.[147] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu
Main category: cs.CL
TL;DR: 论文提出ReCUT方法,通过逐步探索和长短切换采样策略,优化LLMs的推理路径长度和准确性,减少30-50%的推理长度并保持或提高准确性。
Details
Motivation: 现有的CoT提示方法存在过度思考问题,导致推理路径冗长或冗余。现有方法通过多推理链训练LLMs,但受限于生成数据质量和过拟合问题。 Method: ReCUT采用逐步探索机制和长短切换采样策略,生成多样化推理路径,并通过偏好对训练两个专用模型(一个优化准确性,一个优化推理长度),最终通过参数插值整合模型。 Result: 在多个数学推理数据集和骨干模型上,ReCUT显著减少推理长度30-50%,同时保持或提高准确性。 Conclusion: ReCUT有效平衡推理长度和准确性,优于现有基线方法。 Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.[148] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training
Alireza Salemi,Mukta Maddipatla,Hamed Zamani
Main category: cs.CL
TL;DR: mRAG是一个多代理检索增强生成框架,通过分工协作优化任务处理,并在SIGIR 2025 LiveRAG竞赛中表现优于传统RAG。
Details
Motivation: 解决传统RAG在复杂任务中的局限性,通过多代理协作提升生成效果。 Method: 采用自训练范式与奖励引导的轨迹采样,优化代理间协作。 Result: 在SIGIR 2025 LiveRAG竞赛中表现优异,超越传统RAG基线。 Conclusion: mRAG在复杂任务中表现出色,适用于实际场景。 Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.[149] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles
Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: 提出了一种名为SlowFast Sampling的动态采样策略,通过交替探索性和加速解码阶段,显著提升扩散语言模型的生成效率。
Details
Motivation: 现有扩散语言模型的采样策略存在静态行为问题,导致效率低下和灵活性不足。 Method: 提出SlowFast Sampling,基于三个黄金原则(确定性、收敛性和位置性)动态调整解码阶段,并结合dLLM-Cache减少冗余计算。 Result: 实验显示,SlowFast Sampling在LLaDA上实现了15.63倍加速,结合缓存后可达34.22倍,且准确率下降极小。 Conclusion: 该策略证明了精心设计的采样方法可以充分发挥扩散语言模型的潜力,实现高效高质量生成。 Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.[150] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater
Main category: cs.CL
TL;DR: 研究分析了wav2vec2模型在四种语言中如何编码语音信息,发现其表示结构独立于预训练语言。
Details
Motivation: 探讨自监督语音模型在不同语言中的表示方式,填补非英语分析的空白。 Method: 使用探测分类器和几何分析,研究模型如何表示音素、声调和说话者信息。 Result: 音素、声调和说话者信息的子空间基本正交,层间探测准确率模式相似,匹配语言的音素和声调探测在后期略有优势。 Conclusion: wav2vec2学习的表示结构主要独立于预训练语音材料。 Abstract: Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.[151] Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment
Hongda Sun,Jiaren Peng,Wenzhong Yang,Liang He,Bo Du,Rui Yan
Main category: cs.CL
TL;DR: MedRef是一种新型医疗对话系统,通过知识精炼和动态提示调整解决现有系统的问题,在生成质量和医学实体准确性上优于现有方法。
Details
Motivation: 现有医疗对话系统难以识别相关医学知识和生成个性化、准确的响应。 Method: 采用知识精炼机制过滤无关数据,设计综合提示结构,并实现Triplet Filter和Demo Selector模块以动态适应患者需求。 Result: 在MedDG和KaMed基准测试中表现优于现有方法。 Conclusion: MedRef在真实医疗应用中具有高效性和可靠性。 Abstract: Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.[152] Slimming Down LLMs Without Losing Their Minds
Qingda,Mai
Main category: cs.CL
TL;DR: 本文研究了参数高效方法(LoRA和QLoRA)对大型语言模型性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现。
Details
Motivation: 探索参数高效方法在有限资源下对模型性能的提升效果,为开发者提供实用指导。 Method: 使用LoRA和QLoRA方法,在HellaSwag、GSM8K和MMLU-CS三个任务上评估模型性能。 Result: LoRA方法能有效提升任务性能且保持计算效率,性能与微调数据集和任务的对齐程度密切相关。 Conclusion: 研究为参数高效机制提供了理论见解,并为资源有限的开发者提供了实用建议。 Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.[153] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei
Main category: cs.CL
TL;DR: 论文探讨了大语言模型(LLMs)在微调过程中表现出的双重性:既能从新事实中泛化,又容易产生幻觉信息,并提出这两种行为源于同一种机制——上下文外推理(OCR)。
Details
Motivation: 理解LLMs在知识获取中表现出的双重行为(泛化与幻觉)的根本原因。 Method: 通过实验验证OCR机制的存在,并形式化OCR为合成事实回忆任务,分析单层单头注意力模型的性能。 Result: OCR机制解释了泛化和幻觉行为,矩阵分解对模型性能至关重要,梯度下降的隐式偏好是OCR能力的关键。 Conclusion: 研究为理解OCR现象提供了理论基础,为分析和缓解知识注入中的不良行为提供了新视角。 Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.[154] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall
Main category: cs.CL
TL;DR: BioClinical ModernBERT是一种针对生物医学和临床NLP优化的编码器模型,通过大规模预训练和多源数据适应,显著提升了性能和速度。
Details
Motivation: 生物医学和临床NLP中的编码器模型发展较慢,现有模型在领域适应性和性能上存在局限。 Method: 基于ModernBERT,通过53.5亿标记的大规模生物医学和临床语料库进行持续预训练,并整合20个多样化数据集。 Result: 在四个下游任务中表现优于现有生物医学和临床编码器。 Conclusion: BioClinical ModernBERT为生物医学和临床NLP提供了高效的解决方案,并开源了模型和训练检查点。 Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.[155] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Lan Zhang,Marco Valentino,Andre Freitas
Main category: cs.CL
TL;DR: 该论文提出了一种基于LLM的系统性自动评估方法(EFG),用于改进高级数学领域的自动形式化任务评估。
Details
Motivation: 当前自动形式化评估方法在复杂数学领域中效果有限,且人工评估耗时且依赖专业知识,因此需要一种更高效、透明的自动评估方法。 Method: 采用基于逻辑保存(LP)、数学一致性(MC)、形式有效性(FV)和形式质量(FQ)的LLM评委集合(EFG)进行多维度评估。 Result: 实验表明,EFG方法比粗粒度模型更接近人工评估结果,尤其在形式质量评估方面表现更优。 Conclusion: EFG方法为形式数学推理提供了一种可扩展、可解释且可靠的自动评估支持。 Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.[156] Magistral
Mistral-AI,:,Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang
Main category: cs.CL
TL;DR: Magistral是Mistral的首个推理模型,采用自建的强化学习(RL)流程,探索纯RL训练LLM的极限,并展示了一种简单方法强制模型推理语言。RL训练保持了初始检查点的能力,并在多模态理解、指令跟随和函数调用方面有所提升。
Details
Motivation: 探索纯强化学习(RL)训练大型语言模型(LLM)的潜力,摆脱对现有实现和RL轨迹的依赖,采用自建模型和基础设施。 Method: 采用自建的RL流程,从零开始训练模型,提出一种简单方法强制模型推理语言,并验证纯RL训练对模型能力的保持和提升。 Result: RL训练在多模态理解、指令跟随和函数调用方面表现良好,Magistral Medium和Small模型展示了RL训练的成果。 Conclusion: 纯RL训练LLM是可行的,能够保持或提升模型能力,Magistral模型为RL在LLM中的应用提供了新思路。 Abstract: We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.[157] Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
Or Shafran,Atticus Geiger,Mor Geva
Main category: cs.CL
TL;DR: 论文提出了一种新方法(SNMF)用于分解LLM中的MLP激活,以识别可解释的特征方向,优于现有方法(如SAEs)。
Details
Motivation: 现有方法(如SAEs)在因果评估中表现不佳且缺乏内在可解释性,因此需要一种更直接的方法来分解模型的计算。 Method: 使用半非负矩阵分解(SNMF)直接分解MLP激活,生成稀疏线性组合的神经元特征,并将其映射到输入。 Result: SNMF在因果操控任务中优于SAEs和监督基线,且特征与人类可解释概念一致。 Conclusion: SNMF是一种简单有效的工具,可用于识别LLM中的可解释特征和概念表示。 Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.[158] Dynamic Epistemic Friction in Dialogue
Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)与人类偏好对齐中的“认知摩擦”问题,提出动态认知摩擦模型,并基于动态认知逻辑框架分析其在对话中的预测能力。
Details
Motivation: 现有方法在LLMs与人类偏好对齐中忽视了认知摩擦的作用,即对新信息的信念更新阻力。 Method: 定义了动态认知摩擦,并在动态认知逻辑框架下分析其在对话任务中的表现。 Result: 模型能有效预测对话中的信念更新,并可通过更复杂的建模适应现实对话场景。 Conclusion: 动态认知摩擦模型为LLMs与人类协作中的信念对齐提供了新视角,未来可进一步优化以适应复杂场景。 Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of "epistemic friction," or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent's current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.[159] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
Main category: cs.CL
TL;DR: Domain2Vec通过将数据集分解为元域的组合,提出了一种无需训练即可优化语言模型预训练数据混合的方法,显著提升了效率和性能。
Details
Motivation: 现有方法在优化语言模型预训练数据混合时计算成本高,Domain2Vec旨在通过元域分解和分布对齐假设降低计算开销并提升性能。 Method: Domain2Vec将数据集分解为元域的线性组合,利用分类器生成域向量,并通过分布对齐假设优化数据混合。 Result: 实验表明,Domain2Vec仅需51.5%的计算量即可达到相同验证损失,并在相同计算预算下平均提升下游任务性能2.83%。 Conclusion: Domain2Vec为语言模型预训练提供了一种高效且可扩展的数据混合优化方法。 Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.[160] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Kangwei Liu,Siyuan Cheng,Bozhong Tian,Xiaozhuan Liang,Yuyang Yin,Meng Han,Ningyu Zhang,Bryan Hooi,Xi Chen,Shumin Deng
Main category: cs.CL
TL;DR: 论文提出一个中文有害内容检测的基准数据集,结合专家知识规则和大语言模型,提升小模型的检测性能。
Details
Motivation: 现有有害内容检测资源主要集中在英语,中文数据集稀缺且范围有限,亟需一个全面且专业的基准。 Method: 构建了一个涵盖六类有害内容的真实数据集,并提取专家知识规则;提出知识增强基线方法,结合人工规则和大语言模型知识。 Result: 提出的方法使小模型性能接近最先进的大语言模型。 Conclusion: 该基准和知识增强方法为中文有害内容检测提供了有效工具,填补了资源空白。 Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.[161] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Yixin Ou,Yujie Luo,Jingsheng Zheng,Lanning Wei,Shuofei Qiao,Jintian Zhang,Da Zheng,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: AutoMind是一个自适应的大型语言模型(LLM)代理框架,通过专家知识库、树搜索算法和动态代码生成策略,解决了现有框架在复杂任务中的局限性,并在数据科学基准测试中表现优异。
Details
Motivation: 现有LLM驱动的数据科学代理依赖固定工作流程和代码策略,仅适用于简单任务,无法应对复杂创新任务。 Method: AutoMind引入三个关键创新:专家知识库、树搜索算法和自适应代码生成策略。 Result: 在数据科学基准测试中,AutoMind表现优于现有方法,展示了高效性和鲁棒性。 Conclusion: AutoMind为全自动数据科学提供了高效且稳健的解决方案。 Abstract: Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.[162] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva
Main category: cs.CL
TL;DR: 研究发现推理模型能识别但不擅长从有害思维中恢复,大模型表现更差,需改进自评估能力以提高安全性和推理能力。