Skip to content

Table of Contents

cs.CV [Back]

[1] LLM-Enabled Style and Content Regularization for Personalized Text-to-Image Generation

Anran Yu,Wei Feng,Yaochen Zhang,Xiang Li,Lei Meng,Lei Wu,Xiangxu Meng

Main category: cs.CV

TL;DR: 提出了一种结合风格优化和内容保留策略的方法,以提升个性化文本到图像生成的质量和一致性。

Details Motivation: 现有方法在微调模型时存在风格化不足和图像内容不准确的问题,主要由于文本可控性降低。 Method: 采用风格优化策略(利用视觉推理提示和参考图像的语义信息优化风格嵌入)和内容保留策略(保持模型泛化能力)。 Result: 实验证明,该方法在生成一致且个性化的文本到图像输出方面表现优异。 Conclusion: 提出的策略有效解决了风格化和内容准确性之间的平衡问题,提升了生成质量。 Abstract: The personalized text-to-image generation has rapidly advanced with the emergence of Stable Diffusion. Existing methods, which typically fine-tune models using embedded identifiers, often struggle with insufficient stylization and inaccurate image content due to reduced textual controllability. In this paper, we propose style refinement and content preservation strategies. The style refinement strategy leverages the semantic information of visual reasoning prompts and reference images to optimize style embeddings, allowing a more precise and consistent representation of style information. The content preservation strategy addresses the content bias problem by preserving the model's generalization capabilities, ensuring enhanced textual controllability without compromising stylization. Experimental results verify that our approach achieves superior performance in generating consistent and personalized text-to-image outputs.

[2] LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Yuan-Hong Liao,Sven Elflein,Liu He,Laura Leal-Taixé,Yejin Choi,Sanja Fidler,David Acuna

Main category: cs.CV

TL;DR: 论文提出LongPerceptualThoughts数据集,通过三阶段合成框架生成长思维链,提升感知任务的推理性能。

Details Motivation: 探索长思维链在感知任务中的潜力,弥补现有模型在系统2推理上的不足。 Method: 三阶段数据合成框架:1) 从密集图像描述生成可验证选择题;2) 从视觉语言模型提取简单思维链;3) 通过前沿推理模型扩展为长思维链。 Result: 在5个视觉基准测试中平均提升3.4分,V$^*$ Bench提升11.8分;文本推理基准MMLU-Pro也提升2分。 Conclusion: 长思维链不仅适用于数学和代码任务,也能显著提升感知任务的推理性能。 Abstract: Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

[3] Model-based Metric 3D Shape and Motion Reconstruction of Wild Bottlenose Dolphins in Drone-Shot Videos

Daniele Baieri,Riccardo Cicciarella,Michael Krützen,Emanuele Rodolà,Silvia Zuffi

Main category: cs.CV

TL;DR: 提出一种基于模型的方法,通过单目视频估计野生海豚的3D形状和运动,以评估其身体状况。

Details Motivation: 水生动物在自然水下环境中的观测困难,导致其3D重建研究较少,而海豚的身体状况评估需要准确的3D数据。 Method: 采用模型驱动方法,结合传输模型处理水引起的遮挡问题,应用于不同海况下的视频数据。 Result: 估计了海豚的质量和体积,并与基于手动2D测量的方法进行了比较。 Conclusion: 该方法为水生动物3D重建提供了可行方案,有助于身体状况评估。 Abstract: We address the problem of estimating the metric 3D shape and motion of wild dolphins from monocular video, with the aim of assessing their body condition. While considerable progress has been made in reconstructing 3D models of terrestrial quadrupeds, aquatic animals remain unexplored due to the difficulty of observing them in their natural underwater environment. To address this, we propose a model-based approach that incorporates a transmission model to account for water-induced occlusion. We apply our method to video captured under different sea conditions. We estimate mass and volume, and compare our results to a manual 2D measurements-based method.

[4] Event2Vec: Processing neuromorphic events directly by representations in vector space

Wei Fang,Priyadarshini Panda

Main category: cs.CV

TL;DR: 论文提出了一种名为event2vec的新方法,将事件相机输出的异步、稀疏、不规则事件转换为向量表示,解决了与传统计算机视觉和深度学习方法的兼容性问题。

Details Motivation: 事件相机在时间分辨率、能效和动态范围方面优于传统相机,但其输出的事件数据与主流方法不兼容,现有解决方案存在预处理复杂、丢失时间分辨率或无法并行计算等问题。 Method: 受word2vec启发,作者总结了事件与单词的相似性,提出了event2vec方法,将事件转换为向量表示。 Result: 在ASL-DVS数据集上的分类任务中,event2vec表现出更高的参数效率、准确性和速度,优于之前的图/图像/体素表示方法。 Conclusion: event2vec不仅提升了任务性能,还将事件数据与自然语言处理领域对齐,为事件数据融入大型语言和多模态模型提供了可能。 Abstract: The neuromorphic event cameras have overwhelming advantages in temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, the event cameras output asynchronous, sparse, and irregular events, which are not compatible with mainstream computer vision and deep learning methods. Various methods have been proposed to solve this issue but at the cost of long preprocessing procedures, losing temporal resolutions, or being incompatible with massively parallel computation. Inspired by the great success of the word to vector, we summarize the similarities between words and events, then propose the first event to vector (event2vec) representation. We validate event2vec on classifying the ASL-DVS dataset, showing impressive parameter efficiency, accuracy, and speed than previous graph/image/voxel-based representations. Beyond task performance, the most attractive advantage of event2vec is that it aligns events to the domain of natural language processing, showing the promising prospect of integrating events into large language and multimodal models. Our codes, models, and training logs are available at https://github.com/fangwei123456/event2vec.

[5] Towards Understanding Camera Motions in Any Video

Zhiqiu Lin,Siyuan Cen,Daniel Jiang,Jay Karhade,Hewei Wang,Chancharik Mitra,Tiffany Ling,Yuhan Huang,Sifan Liu,Mingyu Chen,Rushikesh Zawar,Xue Bai,Yilun Du,Chuang Gan,Deva Ramanan

Main category: cs.CV

TL;DR: CameraBench是一个用于评估和改进相机运动理解的大规模数据集和基准测试,包含3000个多样化视频,并提出了相机运动分类法。研究发现专家标注和培训能显著提高准确性,同时测试了SfM和VLM模型的性能,并展示了生成式VLM的应用。

Details Motivation: 现有方法在理解相机运动时存在局限性,尤其是语义和几何运动的分辨能力不足,因此需要一个更全面的数据集和基准测试来推动研究。 Method: 构建了CameraBench数据集,包含专家标注的视频和相机运动分类法,通过人类研究评估标注性能,并测试了SfM和VLM模型的表现。 Result: SfM模型难以捕捉依赖场景内容的语义运动,而VLM模型在几何运动上表现不佳。通过微调生成式VLM,实现了语义和几何运动的结合应用。 Conclusion: CameraBench的分类法、基准测试和教程为未来研究提供了基础,目标是实现任何视频中相机运动的全面理解。 Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

[6] Physics Driven Image Simulation from Commercial Satellite Imagery

Scott Sorensen,Wayne Treible,Robert Wagner,Andrew D. Gilliam,Todd Rovito,Joseph L. Mundy

Main category: cs.CV

TL;DR: 利用卫星图像自动生成物理真实的场景模拟,无需激光雷达,提高效率和保真度。

Details Motivation: 通过物理驱动的图像模拟,超越传统渲染管道的限制,自动生成真实场景,减少人工干预。 Method: 基于数字表面模型(DSM)构建场景几何,利用卫星图像估计材料并填充动态元素。 Result: 实现了高保真度的场景模拟,支持从UV到LWIR(200nm-20μm)的算法开发和图像处理。 Conclusion: 该方法为地球上新位置的场景模拟提供了高效、自动化的解决方案,扩展了模拟的可能性。 Abstract: Physics driven image simulation allows for the modeling and creation of realistic imagery beyond what is afforded by typical rendering pipelines. We aim to automatically generate a physically realistic scene for simulation of a given region using satellite imagery to model the scene geometry, drive material estimates, and populate the scene with dynamic elements. We present automated techniques to utilize satellite imagery throughout the simulated scene to expedite scene construction and decrease manual overhead. Our technique does not use lidar, enabling simulations that could not be constructed previously. To develop a 3D scene, we model the various components of the real location, addressing the terrain, modelling man-made structures, and populating the scene with smaller elements such as vegetation and vehicles. To create the scene we begin with a Digital Surface Model, which serves as the basis for scene geometry, and allows us to reason about the real location in a common 3D frame of reference. These simulated scenes can provide increased fidelity with less manual intervention for novel locations on earth, and can facilitate algorithm development, and processing pipelines for imagery ranging from UV to LWIR $(200nm-20\mu m)$.

[7] Plug-and-Play Versatile Compressed Video Enhancement

Huimin Zeng,Jiacheng Li,Zhiwei Xiong

Main category: cs.CV

TL;DR: 提出了一种基于编解码器信息的视频增强框架,通过自适应增强压缩视频质量,提升下游视觉任务的鲁棒性。

Details Motivation: 视频压缩虽减少文件大小,但会降低视觉质量,影响下游视觉模型的性能。 Method: 设计了编解码器感知增强框架,包括压缩感知适应网络(CAA)和比特流感知增强网络(BAE),利用编解码信息自适应增强视频质量。 Result: 实验表明,该框架在质量增强和辅助下游任务方面优于现有方法。 Conclusion: 该框架是一种即插即用的模块,能有效提升压缩视频的质量和下游任务性能。 Abstract: As a widely adopted technique in data transmission, video compression effectively reduces the size of files, making it possible for real-time cloud computing. However, it comes at the cost of visual quality, posing challenges to the robustness of downstream vision models. In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. Specifically, the proposed codec-aware framework consists of a compression-aware adaptation (CAA) network that employs a hierarchical adaptation mechanism to estimate parameters of the frame-wise enhancement network, namely the bitstream-aware enhancement (BAE) network. The BAE network further leverages temporal and spatial priors embedded in the bitstream to effectively improve the quality of compressed input frames. Extensive experimental results demonstrate the superior quality enhancement performance of our framework over existing enhancement methods, as well as its versatility in assisting multiple downstream tasks on compressed videos as a plug-and-play module. Code and models are available at https://huimin-zeng.github.io/PnP-VCVE/.

[8] ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images

Chen Zhao,Anjum Shaik,Joyce H. Keyak,Nancy E. Lane,Jeffrey D. Deng,Kuan-Jui Su,Qiuying Sha,Hui Shen,Hong-Wen Deng,Weihua Zhou

Main category: cs.CV

TL;DR: 提出了一种基于双能X射线吸收测量(DXA)图像的迭代交叉图匹配方法(ICGM-FRAX),用于预测髋部骨折风险。该方法通过将DXA图像转换为图结构,并与模板图匹配,实现了高灵敏度的预测。

Details Motivation: 髋部骨折对老年人健康影响重大,早期准确识别高风险个体对干预至关重要。 Method: 将DXA图像分割为多个感兴趣区域(RoIs),提取放射组学特征并构建图结构,通过迭代匹配测试图与模板图评估骨折风险。 Result: 在547名受试者中,ICGM-FRAX的灵敏度达到0.9869,显示出高预测准确性。 Conclusion: ICGM-FRAX是一种有效的髋部骨折风险预测方法,具有临床应用潜力。 Abstract: Hip fractures represent a major health concern, particularly among the elderly, often leading decreased mobility and increased mortality. Early and accurate detection of at risk individuals is crucial for effective intervention. In this study, we propose Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX), a novel approach for predicting hip fractures using Dual-energy X-ray Absorptiometry (DXA) images. ICGM-FRAX involves iteratively comparing a test (subject) graph with multiple template graphs representing the characteristics of hip fracture subjects to assess the similarity and accurately to predict hip fracture risk. These graphs are obtained as follows. The DXA images are separated into multiple regions of interest (RoIs), such as the femoral head, shaft, and lesser trochanter. Radiomic features are then calculated for each RoI, with the central coordinates used as nodes in a graph. The connectivity between nodes is established according to the Euclidean distance between these coordinates. This process transforms each DXA image into a graph, where each node represents a RoI, and edges derived by the centroids of RoIs capture the spatial relationships between them. If the test graph closely matches a set of template graphs representing subjects with incident hip fractures, it is classified as indicating high hip fracture risk. We evaluated our method using 547 subjects from the UK Biobank dataset, and experimental results show that ICGM-FRAX achieved a sensitivity of 0.9869, demonstrating high accuracy in predicting hip fractures.

[9] MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

Ankit Dhiman,Manan Shah,R Venkatesh Babu

Main category: cs.CV

TL;DR: 本文提出了一种改进扩散模型生成真实镜像反射的方法,通过数据增强和多阶段训练提升模型性能。

Details Motivation: 现有扩散模型在生成镜像反射时难以完全遵循物理规律,尤其是在物体位置和方向变化时表现不佳。 Method: 引入合成数据增强(随机位置、旋转和物体配对)和三阶段训练课程,开发MirrorFusion 2.0模型。 Result: 通过增强数据集和多阶段训练,模型在复杂场景中表现更优,支持了方法的有效性。 Conclusion: 提出的方法显著提升了镜像反射生成的真实性和泛化能力,但仍需进一步优化以适应真实场景。 Abstract: Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synhetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror. Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach. The project page is available at: https://mirror-verse.github.io/.

[10] Context Aware Grounded Teacher for Source Free Object Detection

Tajamul Ashraf,Rajes Manna,Partha Sarathi Purkayastha,Tavaheed Tariq,Janibul Bashir

Main category: cs.CV

TL;DR: 论文提出了一种名为Grounded Teacher (GT)的框架,用于解决源自由目标检测(SFOD)中的上下文偏差问题,通过关系上下文模块和专家基础分支提升模型性能。

Details Motivation: 在医学影像中,源数据不可用时,现有方法因上下文不平衡和领域偏移导致教师模型偏差,影响学生模型性能。 Method: 采用关系上下文模块建模上下文关系,结合专家基础分支监督学生模型,并通过增强相关类别来减少偏差。 Result: 在三个医学数据集上的实验验证了GT框架在减少上下文偏差和提升性能方面的有效性。 Conclusion: GT框架通过关系建模和专家监督,显著改善了SFOD设置下的模型性能,相关资源已开源。 Abstract: We focus on the Source Free Object Detection (SFOD) problem, when source data is unavailable during adaptation, and the model must adapt to the unlabeled target domain. In medical imaging, several approaches have leveraged a semi-supervised student-teacher architecture to bridge domain discrepancy. Context imbalance in labeled training data and significant domain shifts between domains can lead to biased teacher models that produce inaccurate pseudolabels, degrading the student model's performance and causing a mode collapse. Class imbalance, particularly when one class significantly outnumbers another, leads to contextual bias. To tackle the problem of context bias and the significant performance drop of the student model in the SFOD setting, we introduce Grounded Teacher (GT) as a standard framework. In this study, we model contextual relationships using a dedicated relational context module and leverage it to mitigate inherent biases in the model. This approach enables us to apply augmentations to closely related classes, across and within domains, enhancing the performance of underrepresented classes while keeping the effect on dominant classes minimal. We further improve the quality of predictions by implementing an expert foundational branch to supervise the student model. We validate the effectiveness of our approach in mitigating context bias under the SFOD setting through experiments on three medical datasets supported by comprehensive ablation studies. All relevant resources, including preprocessed data, trained model weights, and code, are publicly available at this https://github.com/Tajamul21/Grounded_Teacher.

[11] IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

David Ma,Yuanxing Zhang,Jincheng Ren,Jarvis Guo,Yifan Yao,Zhenlin Wei,Zhenzhu Yang,Zhongyuan Peng,Boyu Feng,Jun Ma,Xiao Gu,Zhoufutu Wen,King Zhu,Yancheng He,Meng Cao,Shiwen Ni,Jiaheng Liu,Wenhao Huang,Ge Zhang,Xiaojie Jin

Main category: cs.CV

TL;DR: IV-Bench是首个专注于图像背景在视频理解中作用的基准测试,包含967个视频和2585个图像-文本查询,覆盖13个任务。当前MLLMs在该任务上表现不佳,最高准确率仅28.9%。

Details Motivation: 现有MLLMs评估框架忽视了图像背景在视频理解中的重要性,IV-Bench旨在填补这一空白。 Method: 提出IV-Bench基准,包含多样化的视频和图像-文本查询任务,并对开源和闭源MLLMs进行评估。 Result: 当前MLLMs在图像背景视频理解任务中表现较差,最高准确率28.9%,关键影响因素包括推理模式、帧数和分辨率。 Conclusion: IV-Bench揭示了MLLMs在图像背景视频理解中的不足,为未来研究提供了重要方向。 Abstract: Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

[12] Manifold Induced Biases for Zero-shot and Few-shot Detection of Generated Images

Jonathan Brokman,Amit Giloni,Omer Hofman,Roman Vainshtein,Hisashi Kojima,Guy Gilboa

Main category: cs.CV

TL;DR: 论文提出了一种基于概率流形分析的零样本和小样本图像检测方法,解决了现有方法缺乏理论支持和性能不足的问题。

Details Motivation: 区分真实与AI生成图像的需求日益迫切,但现有方法在零样本和小样本场景下缺乏理论支持且性能有限。 Method: 通过分析预训练扩散模型的概率流形偏差,利用分数函数近似曲率、梯度和偏差,并扩展至小样本场景。 Result: 在20种生成模型上的实验表明,该方法在零样本和小样本场景下均优于现有方法。 Conclusion: 通过流形分析,本研究提升了生成内容偏差的理论理解和实际应用效果。 Abstract: Distinguishing between real and AI-generated images, commonly referred to as 'image detection', presents a timely and significant challenge. Despite extensive research in the (semi-)supervised regime, zero-shot and few-shot solutions have only recently emerged as promising alternatives. Their main advantage is in alleviating the ongoing data maintenance, which quickly becomes outdated due to advances in generative technologies. We identify two main gaps: (1) a lack of theoretical grounding for the methods, and (2) significant room for performance improvements in zero-shot and few-shot regimes. Our approach is founded on understanding and quantifying the biases inherent in generated content, where we use these quantities as criteria for characterizing generated images. Specifically, we explore the biases of the implicit probability manifold, captured by a pre-trained diffusion model. Through score-function analysis, we approximate the curvature, gradient, and bias towards points on the probability manifold, establishing criteria for detection in the zero-shot regime. We further extend our contribution to the few-shot setting by employing a mixture-of-experts methodology. Empirical results across 20 generative models demonstrate that our method outperforms current approaches in both zero-shot and few-shot settings. This work advances the theoretical understanding and practical usage of generated content biases through the lens of manifold analysis.

[13] Emergence and Evolution of Interpretable Concepts in Diffusion Models

Berk Tinaz,Zalan Fabian,Mahdi Soltanolkotabi

Main category: cs.CV

TL;DR: 该论文利用稀疏自编码器(SAEs)探究文本到图像扩散模型的内部机制,发现可解释的概念,并证明这些概念对生成过程具有因果影响,可用于控制图像生成。

Details Motivation: 扩散模型在文本到图像生成中表现出色,但其内部机制仍不透明。通过机械可解释性技术(如SAEs)揭示其工作原理,有助于更好地理解和控制生成过程。 Method: 使用SAEs框架分析流行文本到图像扩散模型的激活,识别可解释概念,并通过干预技术验证其因果效应。 Result: 发现模型激活中存在可解释概念,这些概念可用于预测和操控图像生成过程,包括早期控制构图、中期调整风格、后期微调细节。 Conclusion: SAEs为理解扩散模型提供了新视角,展示了概念的可解释性和可控性,为生成过程的精准操控奠定了基础。 Abstract: Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from noise through a process called reverse diffusion. Understanding the dynamics of the reverse diffusion process is crucial in steering the generation and achieving high sample quality. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic Interpretability (MI) techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the operating principles of models through granular analysis of their internal representations. These MI techniques have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we show that the discovered concepts have a causal effect on the model output and can be leveraged to steer the generative process. We design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages of diffusion image composition is finalized, however stylistic interventions are effective, and (3) in the final stages of diffusion only minor textural details are subject to change.

[14] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal

Main category: cs.CV

TL;DR: 论文提出了一项新任务CAPTURe,用于测试视觉语言模型(VLMs)对遮挡物体的识别和推理能力,发现现有模型在遮挡情况下表现较差,而人类表现优异。

Details Motivation: 遮挡物体在现实场景中常见,但现有视觉语言模型对其理解不足,需要测试和改进。 Method: 设计了CAPTURe任务,包含真实图像(CAPTURe-real)和合成图像(CAPTURe-synthetic)两部分,评估了四种VLMs的表现。 Result: 模型在遮挡和非遮挡情况下均表现不佳,遮挡时更差,而人类表现优异。提供遮挡物体位置信息可提升模型性能。 Conclusion: VLMs在遮挡推理和计数能力上存在不足,需进一步改进。 Abstract: Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty counting in images.

[15] InstaRevive: One-Step Image Enhancement via Dynamic Score Matching

Yixuan Zhu,Haolin Wang,Ao Li,Wenliang Zhao,Yansong Tang,Jingxuan Niu,Lei Chen,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: InstaRevive是一种基于扩散蒸馏的图像增强框架,通过动态控制和文本提示减少采样步骤,提高效率。

Details Motivation: 解决现有扩散方法计算成本高、采样步骤多的问题。 Method: 采用动态控制的扩散蒸馏管道,结合文本提示增强生成能力。 Result: 在多种任务和数据集上表现高效,生成高质量图像。 Conclusion: InstaRevive在图像增强中展现出高效性和高质量结果。 Abstract: Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at https://github.com/EternalEvan/InstaRevive.

Shichen Li,Chenhui Shao

Main category: cs.CV

TL;DR: 提出了一种多模态数据融合框架,用于实时预测食品干燥状态,显著提升了预测精度和效率。

Details Motivation: 食品干燥的实时预测对节能、生产效率和产品质量至关重要,但现有方法因数据动态性和有限性面临挑战。 Method: 采用端到端多模态数据融合框架,结合视频数据和工艺参数,使用编码器-解码器架构和基于Transformer的解码器。 Result: 模型在糖饼干干燥实验中平均预测误差仅15秒,优于现有方法65.69%。 Conclusion: 该模型在精度、规模和效率上表现优异,适用于多种工业多模态融合任务。 Abstract: Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.

[17] SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

Yunfeng Li,Bo Wang,Jiahao Wan,Xueyi Wu,Ye Li

Main category: cs.CV

TL;DR: 论文提出了首个大规模水下声学目标跟踪基准SonarT165,并提出了高效框架STFTrack,包含多视角模板融合和最优轨迹校正模块,显著提升了性能。

Details Motivation: 水下观测系统在能见度不足时依赖声纳系统,但缺乏统一评估基准限制了现有方法的实用性。 Method: 提出SonarT165基准和STFTrack框架,包含多视角模板融合模块(MTFM)和最优轨迹校正模块(OTCM),并引入声学图像增强和频率增强模块。 Result: STFTrack在SonarT165基准上实现了最先进的性能。 Conclusion: SonarT165基准和STFTrack框架为水下声学目标跟踪提供了有效工具,解决了现有方法的局限性。 Abstract: Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.

[18] HS-Mamba: Full-Field Interaction Multi-Groups Mamba for Hyperspectral Image Classification

Hongxing Peng,Kang Lin,Huanai Liu

Main category: cs.CV

TL;DR: 论文提出了一种基于Mamba架构的HS-Mamba框架,用于高光谱图像分类,结合局部和全局特征,显著提升了分类精度。

Details Motivation: 高光谱图像分类是遥感领域的热点,但高维度和特征内联特性使Mamba架构的应用面临挑战。 Method: HS-Mamba采用双通道空间-光谱编码器模块和轻量级全局内联注意力分支,结合局部和全局特征。 Result: 在四个基准数据集上,HS-Mamba优于现有最先进方法。 Conclusion: HS-Mamba通过融合局部和全局特征,实现了高精度的高光谱图像分类。 Abstract: Hyperspectral image (HSI) classification has been one of the hot topics in remote sensing fields. Recently, the Mamba architecture based on selective state-space models (S6) has demonstrated great advantages in long sequence modeling. However, the unique properties of hyperspectral data, such as high dimensionality and feature inlining, pose challenges to the application of Mamba to HSI classification. To compensate for these shortcomings, we propose an full-field interaction multi-groups Mamba framework (HS-Mamba), which adopts a strategy different from pixel-patch based or whole-image based, but combines the advantages of both. The patches cut from the whole image are sent to multi-groups Mamba, combined with positional information to perceive local inline features in the spatial and spectral domains, and the whole image is sent to a lightweight attention module to enhance the global feature representation ability. Specifically, HS-Mamba consists of a dual-channel spatial-spectral encoder (DCSS-encoder) module and a lightweight global inline attention (LGI-Att) branch. The DCSS-encoder module uses multiple groups of Mamba to decouple and model the local features of dual-channel sequences with non-overlapping patches. The LGI-Att branch uses a lightweight compressed and extended attention module to perceive the global features of the spatial and spectral domains of the unsegmented whole image. By fusing local and global features, high-precision classification of hyperspectral images is achieved. Extensive experiments demonstrate the superiority of the proposed HS-Mamba, outperforming state-of-the-art methods on four benchmark HSI datasets.

[19] AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization

Jinda Lu,Jinghan Li,Yuan Gao,Junkang Wu,Jiancan Wu,Xiang Wang,Xiangnan He

Main category: cs.CV

TL;DR: AdaViP通过视觉增强偏好优化,结合视觉和语言偏好,显著减少多模态大语言模型的幻觉问题。

Details Motivation: 现有方法主要关注语言偏好,忽视了视觉上下文的重要性。 Method: 提出AdaViP,包括视觉偏好对构建和自适应偏好优化。 Result: AdaViP-7B在Object HalBench上分别减少93.7%和96.4%的幻觉问题。 Conclusion: AdaViP在多模态对齐中表现优异,显著优于现有方法。 Abstract: Preference alignment through Direct Preference Optimization (DPO) has demonstrated significant effectiveness in aligning multimodal large language models (MLLMs) with human preferences. However, existing methods focus primarily on language preferences while neglecting the critical visual context. In this paper, we propose an Adaptive Vision-enhanced Preference optimization (AdaViP) that addresses these limitations through two key innovations: (1) vision-based preference pair construction, which integrates multiple visual foundation models to strategically remove key visual elements from the image, enhancing MLLMs' sensitivity to visual details; and (2) adaptive preference optimization that dynamically balances vision- and language-based preferences for more accurate alignment. Extensive evaluations across different benchmarks demonstrate our effectiveness. Notably, our AdaViP-7B achieves 93.7% and 96.4% reductions in response-level and mentioned-level hallucination respectively on the Object HalBench, significantly outperforming current state-of-the-art methods.

[20] FaceInsight: A Multimodal Large Language Model for Face Perception

Jingzhi Li,Changjiang Luo,Ruoyu Chen,Hua Zhang,Wenqi Ren,Jianhou Gan,Xiaochun Cao

Main category: cs.CV

TL;DR: FaceInsight是一种多功能面部感知MLLM,通过视觉-文本对齐和辅助感知模态提升面部任务性能。

Details Motivation: 通用MLLMs在面部感知任务中表现不佳,FaceInsight旨在填补这一空白。 Method: 结合视觉-文本对齐和面部分割图,增强语义理解。 Result: 在三个面部感知任务中,FaceInsight优于九种对比MLLMs。 Conclusion: FaceInsight通过多模态方法显著提升了面部感知能力。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.

[21] ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

Doanh C. Bui,Hoai Luan Pham,Vu Trung Duong Le,Tuan Hai Vu,Van Duy Tran,Yasuhiko Nakashima

Main category: cs.CV

TL;DR: 比较传统持续学习方法和视觉语言零样本分类在WSI终身学习中的表现。

Details Motivation: 解决WSI多任务终身学习的实际问题,避免重复训练新模型。 Method: 比较常规持续学习策略与视觉语言零样本分类方法。 Result: 实验结果表明视觉语言模型是否足够或需要进一步研究持续学习策略。 Conclusion: 首次比较两种方法,为WSI终身学习提供新视角。 Abstract: Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon.

[22] AffordanceSAM: Segment Anything Once More in Affordance Grounding

Dengyang Jiang,Mengmeng Wang,Teli Ma,Hengzhuang Li,Yong liu,Guang Dai,Lei Zhang

Main category: cs.CV

TL;DR: AffordanceSAM通过扩展SAM的分割能力到功能区域识别,提升了模型对未见物体和功能的泛化能力。

Details Motivation: 现有模型在功能区域识别上的泛化能力不足,难以适应真实场景需求。 Method: 提出AffordanceSAM,包含功能适应模块和由粗到细的训练策略,优化SAM的分割输出以适应功能区域识别。 Result: 在AGD20K基准测试中表现优异,并能处理新物体和功能的任务。 Conclusion: AffordanceSAM显著提升了功能区域识别的泛化能力,适用于复杂场景。 Abstract: Improving the generalization ability of an affordance grounding model to recognize regions for unseen objects and affordance functions is crucial for real-world application. However, current models are still far away from such standards. To address this problem, we introduce AffordanceSAM, an effective approach that extends SAM's generalization capacity to the domain of affordance grounding. For the purpose of thoroughly transferring SAM's robust performance in segmentation to affordance, we initially propose an affordance-adaption module in order to help modify SAM's segmentation output to be adapted to the specific functional regions required for affordance grounding. We concurrently make a coarse-to-fine training recipe to make SAM first be aware of affordance objects and actions coarsely, and then be able to generate affordance heatmaps finely. Both quantitative and qualitative experiments show the strong generalization capacity of our AffordanceSAM, which not only surpasses previous methods under AGD20K benchmark but also shows evidence to handle the task with novel objects and affordance functions.

[23] DiTPainter: Efficient Video Inpainting with Diffusion Transformers

Xian Wu,Chang Liu

Main category: cs.CV

TL;DR: DiTPainter是一种基于扩散变换器(DiT)的视频修复模型,通过高效的自定义Transformer网络解决现有方法在模糊和不一致问题上的不足,且无需依赖大型预训练模型。

Details Motivation: 现有视频修复算法依赖光流传播像素,易因光流不准确或大遮挡区域导致模糊和不一致;而预训练的DiT模型参数量大,难以直接应用于视频修复。 Method: 提出DiTPainter,一种端到端的视频修复模型,基于DiT但采用专为视频修复设计的高效Transformer网络,从头训练而非依赖预训练模型。 Result: DiTPainter能处理任意长度视频,适用于视频去字幕和补全任务,实验显示其在质量和时空一致性上优于现有方法。 Conclusion: DiTPainter通过高效设计和从头训练,解决了现有方法的局限性,为视频修复提供了更优解决方案。 Abstract: Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.

[24] Motion-Enhanced Nonlocal Similarity Implicit Neural Representation for Infrared Dim and Small Target Detection

Pei Liu,Yisi Luo,Wenzhen Wang,Xiangyong Cao

Main category: cs.CV

TL;DR: 提出了一种基于运动增强和非局部相似性的隐式神经表示框架,用于红外弱小目标检测,解决了动态背景和目标信号弱的问题。

Details Motivation: 传统低秩稀疏模型难以捕捉动态背景和全局时空相关性,导致背景泄漏或目标丢失。 Method: 结合光流运动估计和多帧融合增强运动显著性,利用非局部相似性构建低秩块张量,提出基于张量分解的隐式神经表示模型。 Result: 实验表明,该方法能有效分离弱小目标与复杂背景,检测精度和鲁棒性优于现有方法。 Conclusion: 提出的框架在红外弱小目标检测中表现出色,解决了动态背景和弱目标信号的挑战。 Abstract: Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.

[25] DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining

Wei Zhuo,Zhiyue Tang,Wufeng Xue,Hao Ding,Linlin Shen

Main category: cs.CV

TL;DR: 论文提出FS-DINO,一种结合DINOv2和SAM知识的小样本语义分割方法,通过轻量级分割器和跨模型蒸馏实现高效分割。

Details Motivation: 解决小样本语义分割中数据稀缺问题,探索如何统一利用DINOv2和SAM的互补能力。 Method: 提出FS-DINO,仅使用DINOv2编码器和轻量级分割器,通过瓶颈适配器、元视觉提示生成器和解码器实现,并结合SAM知识进行跨模型蒸馏。 Result: 在COCO-20i、PASCAL-5i和FSS-1000数据集上验证了方法的有效性和优越性。 Conclusion: FS-DINO成功整合了两种基础模型的知识,为小样本语义分割提供了高效解决方案。 Abstract: Few-shot semantic segmentation has gained increasing interest due to its generalization capability, i.e., segmenting pixels of novel classes requiring only a few annotated images. Prior work has focused on meta-learning for support-query matching, with extensive development in both prototype-based and aggregation-based methods. To address data scarcity, recent approaches have turned to foundation models to enhance representation transferability for novel class segmentation. Among them, a hybrid dual-modal framework including both DINOv2 and SAM has garnered attention due to their complementary capabilities. We wonder "can we build a unified model with knowledge from both foundation models?" To this end, we propose FS-DINO, with only DINOv2's encoder and a lightweight segmenter. The segmenter features a bottleneck adapter, a meta-visual prompt generator based on dense similarities and semantic embeddings, and a decoder. Through coarse-to-fine cross-model distillation, we effectively integrate SAM's knowledge into our lightweight segmenter, which can be further enhanced by 4D correlation mining on support-query pairs. Extensive experiments on COCO-20i, PASCAL-5i, and FSS-1000 demonstrate the effectiveness and superiority of our method.

[26] Vidi: Large Multimodal Models for Video Understanding and Editing

Vidi Team,Celong Liu,Chia-Wen Kuo,Dawei Du,Fan Chen,Guang Chen,Jiamin Yuan,Lingxi Zhang,Lu Guo,Lusha Li,Longyin Wen,Qingyu Chen,Rachel Deng,Sijie Zhu,Stuart Siew,Tong Jin,Wei Lu,Wen Zhong,Xiaohui Shen,Xin Gu,Xing Mei,Xueqiong Qu

Main category: cs.CV

TL;DR: 论文介绍了Vidi,一种用于视频理解和编辑的大型多模态模型家族,专注于时间检索任务,并在新基准VUE-TR上显著优于现有模型。

Details Motivation: 视频成为互联网主要传播媒介,但传统模型难以处理多模态和长视频输入,需支持高质量视频编辑。 Method: 提出Vidi模型家族,首版专注于时间检索任务,能处理小时级视频并具备强时间理解能力。 Result: Vidi在时间检索任务上显著优于GPT-4o和Gemini等专有模型。 Conclusion: Vidi在视频编辑场景中表现出色,新基准VUE-TR为未来研究提供了更全面的评估标准。 Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

[27] You Sense Only Once Beneath: Ultra-Light Real-Time Underwater Object Detection

Jun Dong,Wenli Wu,Jintao Cheng,Xiaoyu Tang

Main category: cs.CV

TL;DR: 提出了一种超轻量实时水下目标检测框架YSOOB,通过多频谱小波编码和动态信息增强技术,显著提升了模型在低质量水下图像中的性能。

Details Motivation: 水下环境中的低图像质量和有限计算资源对目标检测模型的准确性和效率提出了挑战。 Method: 采用多频谱小波编码(MSWE)减少光学颜色失真,动态选择关键信息增强模型泛化能力,并通过通道压缩和重构大核卷积(RLKC)实现模型轻量化。 Result: YSOOB仅120万参数,在URPC2020和DUO数据集上mAP50分别达到83.1%和82.9%,推理速度显著优于YOLOv12-N。 Conclusion: YSOOB在轻量化和高性能之间取得了平衡,适用于水下目标检测的实时应用。 Abstract: Despite the remarkable achievements in object detection, the model's accuracy and efficiency still require further improvement under challenging underwater conditions, such as low image quality and limited computational resources. To address this, we propose an Ultra-Light Real-Time Underwater Object Detection framework, You Sense Only Once Beneath (YSOOB). Specifically, we utilize a Multi-Spectrum Wavelet Encoder (MSWE) to perform frequency-domain encoding on the input image, minimizing the semantic loss caused by underwater optical color distortion. Furthermore, we revisit the unique characteristics of even-sized and transposed convolutions, allowing the model to dynamically select and enhance key information during the resampling process, thereby improving its generalization ability. Finally, we eliminate model redundancy through a simple yet effective channel compression and reconstructed large kernel convolution (RLKC) to achieve model lightweight. As a result, forms a high-performance underwater object detector YSOOB with only 1.2 million parameters. Extensive experimental results demonstrate that, with the fewest parameters, YSOOB achieves mAP50 of 83.1% and 82.9% on the URPC2020 and DUO datasets, respectively, comparable to the current SOTA detectors. The inference speed reaches 781.3 FPS and 57.8 FPS on the T4 GPU (TensorRT FP16) and the edge computing device Jetson Xavier NX (TensorRT FP16), surpassing YOLOv12-N by 28.1% and 22.5%, respectively.

[28] RePOPE: Impact of Annotation Errors on the POPE Benchmark

Yannic Neuhaus,Matthias Hein

Main category: cs.CV

TL;DR: 研究评估了MSCOCO标签错误对POPE基准的影响,重新标注后发现模型排名显著变化。

Details Motivation: 数据标注成本高,现有基准数据集可能包含标签错误,影响评估结果。 Method: 重新标注POPE基准图像,分析标签错误分布,评估模型在修正后的RePOPE上的表现。 Result: 发现标签错误分布不均,模型排名因标签质量变化而显著改变。 Conclusion: 标签质量对基准评估至关重要,RePOPE提供了更可靠的评估标准。 Abstract: Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

[29] Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

Dasol Jeong,Donggoo Kang,Jiwon Park,Hyebean Lee,Joonki Paik

Main category: cs.CV

TL;DR: 提出了一种基于扩散的零样本图像编辑框架,统一了文本引导和参考引导方法,无需微调。

Details Motivation: 旨在实现无需微调的统一图像编辑方法,同时保持源图像的结构完整性。 Method: 利用扩散反演和时间步特定的空文本嵌入,结合分阶段潜在注入策略(早期形状注入,后期属性注入),并通过参考潜在进行跨注意力语义对齐。 Result: 在表情转移、纹理变换和风格融合等任务中表现出色,展示了方法的可扩展性和适应性。 Conclusion: 该方法在多样化的图像编辑场景中实现了最先进的性能。 Abstract: We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.

[30] SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems

Manjunath D,Aniruddh Sikdar,Prajwal Gurunath,Sumanth Udupa,Suresh Sundaram

Main category: cs.CV

TL;DR: 论文提出了一种名为SAGA的新策略,用于解决RGB到IR图像域适应中的颜色偏差问题,并引入了一个多传感器数据集IndraEye。实验表明SAGA显著提升了性能。

Details Motivation: 由于IR图像缺乏颜色和纹理信息,RGB训练的模型在IR域中表现不佳,导致高误检率和低质量伪标签。 Method: 提出Semantic-Aware Gray color Augmentation (SAGA),通过提取与IR图像相关的对象级特征来减少颜色偏差。同时,发布IndraEye数据集,包含5,612张RGB-IR图像。 Result: SAGA在RGB到IR的域适应中表现优异,性能提升0.4%至7.6%(mAP)。 Conclusion: SAGA和IndraEye数据集为多模态学习和域适应提供了有效工具,尤其在无人机图像处理中表现突出。 Abstract: Domain-adaptive thermal object detection plays a key role in facilitating visible (RGB)-to-thermal (IR) adaptation by reducing the need for co-registered image pairs and minimizing reliance on large annotated IR datasets. However, inherent limitations of IR images, such as the lack of color and texture cues, pose challenges for RGB-trained models, leading to increased false positives and poor-quality pseudo-labels. To address this, we propose Semantic-Aware Gray color Augmentation (SAGA), a novel strategy for mitigating color bias and bridging the domain gap by extracting object-level features relevant to IR images. Additionally, to validate the proposed SAGA for drone imagery, we introduce the IndraEye, a multi-sensor (RGB-IR) dataset designed for diverse applications. The dataset contains 5,612 images with 145,666 instances, captured from diverse angles, altitudes, backgrounds, and times of day, offering valuable opportunities for multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor-specific strengths and weaknesses. IndraEye aims to enhance the development of more robust and accurate aerial perception systems, especially in challenging environments. Experimental results show that SAGA significantly improves RGB-to-IR adaptation for autonomous driving and IndraEye dataset, achieving consistent performance gains of +0.4% to +7.6% (mAP) when integrated with state-of-the-art domain adaptation techniques. The dataset and codes are available at https://github.com/airliisc/IndraEye.

[31] GADS: A Super Lightweight Model for Head Pose Estimation

Menan Velayuthan,Asiri Gawesha,Purushoth Velayuthan,Nuwan Kodagoda,Dharshana Kasthurirathna,Pradeepa Samarasinghe

Main category: cs.CV

TL;DR: 提出了一种名为GADS的新架构,通过分组地标和使用小型Deep Set层降低计算复杂度,显著减小模型大小并提高速度。

Details Motivation: 现有基于地标的方法过于注重精度,忽视了模型大小和简单性,限制了在边缘设备上的部署。 Method: 采用Grouped Attention Deep Sets (GADS)架构,分组地标并使用多注意力机制提取组间信息。 Result: 模型比当前最轻的SOTA模型小7.5倍,速度快25倍,比性能最佳模型小4321倍。 Conclusion: GADS为资源受限的头姿估计方法提供了强大的基线。 Abstract: In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is $7.5\times$ smaller and executes $25\times$ faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being $4321\times$ smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.

[32] DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy

Qirui Yang,Fangpu Zhang,Yeying Jin,Qihua Cheng,Pengtao Jiang,Huanjing Yue,Jingyu Yang

Main category: cs.CV

TL;DR: 提出了一种单阶段原始域去摩尔纹框架DSDNet,通过双流网络和动态调制模块提升视觉质量与效率。

Details Motivation: 智能手机拍摄屏幕时摩尔纹问题严重,现有方法存在信息损失或效率低下的问题。 Method: 设计双流网络和动态调制模块,结合原始域与YCbCr图像,提升去摩尔纹效果。 Result: DSDNet在视觉质量和定量评估上优于现有方法,推理速度快2.4倍。 Conclusion: DSDNet在去摩尔纹任务中表现出色,兼具高效与高质量。 Abstract: With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moir\'e artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoir\'eing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoir\'eing framework, Dual-Stream Demoir\'eing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moir\'e while preserving luminance and color fidelity. Specifically, to guide luminance correction and moir\'e removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation, and achieves an inference speed $\mathrm{\textbf{2.4x}}$ faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at https://xxxxxxxxdsdnet.github.io/DSDNet/.

[33] Multi-Scale Tensorial Summation and Dimensional Reduction Guided Neural Network for Edge Detection

Lei Xu,Mehmet Yamac,Mete Ahishali,Moncef Gabbouj

Main category: cs.CV

TL;DR: 本文提出了一种基于MTS-DR模块的神经网络MTS-DR-Net,用于边缘检测任务,通过减少冗余信息并聚焦相关子空间,显著提升了性能。

Details Motivation: 边缘检测在计算机视觉任务中具有重要作用,但传统方法需要大感受野,导致网络结构过深。MTS因子化操作虽能解决这一问题,但仍需进一步优化。 Method: 提出MTS-DR模块作为新主干网络,结合MTS层和MTS-DR块减少冗余信息,并引入U形细化模块。 Result: 在BSDS500和BIPEDv2数据集上的实验验证了MTS-DR-Net的有效性。 Conclusion: MTS-DR-Net通过优化信息处理方式,显著提升了边缘检测任务的性能。 Abstract: Edge detection has attracted considerable attention thanks to its exceptional ability to enhance performance in downstream computer vision tasks. In recent years, various deep learning methods have been explored for edge detection tasks resulting in a significant performance improvement compared to conventional computer vision algorithms. In neural networks, edge detection tasks require considerably large receptive fields to provide satisfactory performance. In a typical convolutional operation, such a large receptive field can be achieved by utilizing a significant number of consecutive layers, which yields deep network structures. Recently, a Multi-scale Tensorial Summation (MTS) factorization operator was presented, which can achieve very large receptive fields even from the initial layers. In this paper, we propose a novel MTS Dimensional Reduction (MTS-DR) module guided neural network, MTS-DR-Net, for the edge detection task. The MTS-DR-Net uses MTS layers, and corresponding MTS-DR blocks as a new backbone to remove redundant information initially. Such a dimensional reduction module enables the neural network to focus specifically on relevant information (i.e., necessary subspaces). Finally, a weight U-shaped refinement module follows MTS-DR blocks in the MTS-DR-Net. We conducted extensive experiments on two benchmark edge detection datasets: BSDS500 and BIPEDv2 to verify the effectiveness of our model. The implementation of the proposed MTS-DR-Net can be found at https://github.com/LeiXuAI/MTS-DR-Net.git.

[34] Pose Optimization for Autonomous Driving Datasets using Neural Rendering Models

Quentin Herau,Nathan Piasco,Moussab Bennehar,Luis Rolado,Dzmitry Tsishkou,Bingbing Liu,Cyrille Migniot,Pascal Vasseur,Cédric Demonceaux

Main category: cs.CV

TL;DR: 提出了一种基于NeRF的优化方法,用于改进自动驾驶数据集中传感器位姿和校准参数,提升数据集的准确性。

Details Motivation: 公共数据集中传感器校准和车辆位姿的不准确性可能导致下游任务评估错误,影响自动驾驶系统的可靠性。 Method: 采用NeRF进行传感器位姿和校准参数的鲁棒优化,并通过重投影指标、新视角合成渲染质量和几何对齐进行验证。 Result: 方法显著提高了传感器位姿的准确性,提升了数据集的实用性。 Conclusion: 优化后的传感器位姿公开可用,为研究社区提供了有价值的资源,推动了自动驾驶模型的可靠性发展。 Abstract: Autonomous driving systems rely on accurate perception and localization of the ego car to ensure safety and reliability in challenging real-world driving scenarios. Public datasets play a vital role in benchmarking and guiding advancement in research by providing standardized resources for model development and evaluation. However, potential inaccuracies in sensor calibration and vehicle poses within these datasets can lead to erroneous evaluations of downstream tasks, adversely impacting the reliability and performance of the autonomous systems. To address this challenge, we propose a robust optimization method based on Neural Radiance Fields (NeRF) to refine sensor poses and calibration parameters, enhancing the integrity of dataset benchmarks. To validate improvement in accuracy of our optimized poses without ground truth, we present a thorough evaluation process, relying on reprojection metrics, Novel View Synthesis rendering quality, and geometric alignment. We demonstrate that our method achieves significant improvements in sensor pose accuracy. By optimizing these critical parameters, our approach not only improves the utility of existing datasets but also paves the way for more reliable autonomous driving models. To foster continued progress in this field, we make the optimized sensor poses publicly available, providing a valuable resource for the research community.

[35] Towards prediction of morphological heart age from computed tomography angiography

Johan Öfverstedt,Elin Lundström,Håkan Ahlström,Joel Kullberg

Main category: cs.CV

TL;DR: 该研究通过CTA图像预测年龄,提出了一种基于图像配准和机器学习的形态学心脏年龄生物标志物,并分析了心脏形态与衰老的关系。

Details Motivation: 研究心脏形态与衰老的关系,并开发一种新的形态学心脏年龄生物标志物。 Method: 使用图像配准方法标准化图像,提取密度和局部体积的稳健特征,训练机器学习模型预测年龄。 Result: 在SCAPIS数据集中,女性和男性的平均绝对误差分别为2.74和2.77年,预测结果与形态学高度一致。 Conclusion: 形态学心脏年龄可作为可靠的生物标志物,且模型的可解释性通过显著性分析得到提升。 Abstract: Age prediction from medical images or other health-related non-imaging data is an important approach to data-driven aging research, providing knowledge of how much information a specific tissue or organ carries about the chronological age of the individual. In this work, we studied the prediction of age from computed tomography angiography (CTA) images, which provide detailed representations of the heart morphology, with the goals of (i) studying the relationship between morphology and aging, and (ii) developing a novel \emph{morphological heart age} biomarker. We applied an image registration-based method that standardizes the images from the whole cohort into a single space. We then extracted supervoxels (using unsupervised segmentation), and corresponding robust features of density and local volume, which provide a detailed representation of the heart morphology while being robust to registration errors. Machine learning models are then trained to fit regression models from these features to the chronological age. We applied the method to a subset of the images from the Swedish CArdioPulomonary bioImage Study (SCAPIS) dataset, consisting of 721 females and 666 males. We observe a mean absolute error of $2.74$ years for females and $2.77$ years for males. The predictions from different sub-regions of interest were observed to be more highly correlated with the predictions from the whole heart, compared to the chronological age, revealing a high consistency in the predictions from morphology. Saliency analysis was also performed on the prediction models to study what regions are associated positively and negatively with the predicted age. This resulted in detailed association maps where the density and volume of known, as well as some novel sub-regions of interest, are determined to be important. The saliency analysis aids in the interpretability of the models and their predictions.

[36] Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views

Ningli Xu,Rongjun Qin

Main category: cs.CV

TL;DR: 提出了一种新颖的跨视角合成方法,通过固定潜在扩散模型和两个条件模块,解决了卫星图像生成地面视图时的视角和分辨率差异问题,并贡献了一个大规模数据集。

Details Motivation: 卫星图像与地面视图之间存在视角和分辨率差异,导致生成的地面视图在相邻区域不一致,现有方法多为单视图生成。 Method: 基于固定潜在扩散模型,引入卫星引导去噪和卫星-时间去噪两个条件模块,分别提取场景布局和相机运动信息。 Result: 实验表明,该方法在感知和时间指标上优于现有方法,生成的多视图输出具有高真实感和一致性。 Conclusion: 该方法有效解决了跨视角合成中的一致性问题,并提供了大规模数据集支持未来研究。 Abstract: Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.

[37] Development and evaluation of a deep learning algorithm for German word recognition from lip movements

Dinh Nam Pham,Torsten Rahne

Main category: cs.CV

TL;DR: 论文提出了一种基于神经网络的德语唇读算法,通过3D CNN和GRU模型结合,显著提高了识别准确率,尤其在唇部区域裁剪时表现最佳。

Details Motivation: 现有唇读算法多为英语设计,德语缺乏相关研究,且传统方法错误率高。 Method: 使用1806个德语视频片段,训练3D CNN、GRU及组合模型(GRUConv),比较不同图像区域和色彩空间。 Result: GRUConv模型对已知和未知说话者的准确率分别为87%和63%,唇部裁剪显著提升性能。 Conclusion: 德语唇读算法首次实现高准确率,可推广至更多词汇类别,性能媲美英语算法。 Abstract: When reading lips, many people benefit from additional visual information from the lip movements of the speaker, which is, however, very error prone. Algorithms for lip reading with artificial intelligence based on artificial neural networks significantly improve word recognition but are not available for the German language. A total of 1806 video clips with only one German-speaking person each were selected, split into word segments, and assigned to word classes using speech-recognition software. In 38,391 video segments with 32 speakers, 18 polysyllabic, visually distinguishable words were used to train and validate a neural network. The 3D Convolutional Neural Network and Gated Recurrent Units models and a combination of both models (GRUConv) were compared, as were different image sections and color spaces of the videos. The accuracy was determined in 5000 training epochs. Comparison of the color spaces did not reveal any relevant different correct classification rates in the range from 69% to 72%. With a cut to the lips, a significantly higher accuracy of 70% was achieved than when cut to the entire speaker's face (34%). With the GRUConv model, the maximum accuracies were 87% with known speakers and 63% in the validation with unknown speakers. The neural network for lip reading, which was first developed for the German language, shows a very high level of accuracy, comparable to English-language algorithms. It works with unknown speakers as well and can be generalized with more word classes.

[38] Locating and Mitigating Gradient Conflicts in Point Cloud Domain Adaptation via Saliency Map Skewness

Jiaqi Tang,Yinsong Xu,Qingchao Chen

Main category: cs.CV

TL;DR: 提出了一种基于显著性图的数据采样块(SM-DSB),用于解决点云无监督域适应(UDA)中自监督任务梯度冲突的问题。

Details Motivation: 现有方法在多任务学习(MTL)框架中结合自监督任务,但某些梯度可能对分类性能产生负面影响。 Method: 设计了一种基于3D显著性图偏度的评分机制,动态筛选有益样本,减少梯度冲突。 Result: 方法在实验中优于现有技术,且计算开销小。 Conclusion: SM-DSB为UDA问题提供了新视角,并通过反向传播分析增强了理解。 Abstract: Object classification models utilizing point cloud data are fundamental for 3D media understanding, yet they often struggle with unseen or out-of-distribution (OOD) scenarios. Existing point cloud unsupervised domain adaptation (UDA) methods typically employ a multi-task learning (MTL) framework that combines primary classification tasks with auxiliary self-supervision tasks to bridge the gap between cross-domain feature distributions. However, our further experiments demonstrate that not all gradients from self-supervision tasks are beneficial and some may negatively impact the classification performance. In this paper, we propose a novel solution, termed Saliency Map-based Data Sampling Block (SM-DSB), to mitigate these gradient conflicts. Specifically, our method designs a new scoring mechanism based on the skewness of 3D saliency maps to estimate gradient conflicts without requiring target labels. Leveraging this, we develop a sample selection strategy that dynamically filters out samples whose self-supervision gradients are not beneficial for the classification. Our approach is scalable, introducing modest computational overhead, and can be integrated into all the point cloud UDA MTL frameworks. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches. In addition, we provide a new perspective on understanding the UDA problem through back-propagation analysis.

[39] Human-Imperceptible Physical Adversarial Attack for NIR Face Recognition Models

Songyan Xie,Jinghang Wen,Encheng Su,Qiucheng Yu

Main category: cs.CV

TL;DR: 提出了一种新型的、隐蔽且实用的对抗性补丁方法,用于在黑盒设置下攻击近红外(NIR)人脸识别系统,通过优化数字形状和位置,并结合光反射模型,显著提高了攻击成功率。

Details Motivation: 近红外人脸识别系统在低光或化妆条件下表现良好,但对物理对抗攻击存在脆弱性,本研究旨在揭示其实际应用中的潜在风险。 Method: 使用人眼不可见的红外吸收墨水生成数字优化的补丁,并通过光反射模型模拟NIR光反射,减少数字与真实世界成像的差异。 Result: 实验表明,该方法在数字和物理领域的攻击成功率均优于现有技术,物理领域平均成功率达82.46%,显著高于现有方法的64.18%。 Conclusion: 该方法有效提升了对抗性补丁在NIR人脸识别系统中的攻击性能,尤其在多种姿态下保持高效,展示了实际应用中的潜在威胁。 Abstract: Near-infrared (NIR) face recognition systems, which can operate effectively in low-light conditions or in the presence of makeup, exhibit vulnerabilities when subjected to physical adversarial attacks. To further demonstrate the potential risks in real-world applications, we design a novel, stealthy, and practical adversarial patch to attack NIR face recognition systems in a black-box setting. We achieved this by utilizing human-imperceptible infrared-absorbing ink to generate multiple patches with digitally optimized shapes and positions for infrared images. To address the optimization mismatch between digital and real-world NIR imaging, we develop a light reflection model for human skin to minimize pixel-level discrepancies by simulating NIR light reflection. Compared to state-of-the-art (SOTA) physical attacks on NIR face recognition systems, the experimental results show that our method improves the attack success rate in both digital and physical domains, particularly maintaining effectiveness across various face postures. Notably, the proposed approach outperforms SOTA methods, achieving an average attack success rate of 82.46% in the physical domain across different models, compared to 64.18% for existing methods. The artifact is available at https://anonymous.4open.science/r/Human-imperceptible-adversarial-patch-0703/.

[40] Text-based Animatable 3D Avatars with Morphable Model Alignment

Yiqian Wu,Malte Prinzler,Xiaogang Jin,Siyu Tang

Main category: cs.CV

TL;DR: 提出AnimPortrait3D框架,通过结合预训练模型和ControlNet,解决文本生成3D头像时的外观、几何和对齐问题,提升动画质量。

Details Motivation: 现有文本生成3D头像方法在细节真实性和动画对齐上表现不佳,主要因2D扩散预测的模糊性和语义对齐不足。 Method: 1. 使用预训练文本到3D模型初始化头像;2. 通过ControlNet结合变形模型的语义和法线图优化动态表情。 Result: 实验表明,该方法在合成质量、对齐和动画保真度上优于现有方法。 Conclusion: AnimPortrait3D在文本生成可动画3D头像领域取得显著进展。 Abstract: The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar's appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation.

[41] DERD-Net: Learning Depth from Event-based Ray Densities

Diego de Oliveira Hitzges,Suman Ghosh,Guillermo Gallego

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的深度估计框架,通过处理异步事件数据,在单目和立体设置中实现高效深度预测。

Details Motivation: 传统深度学习框架难以处理事件相机的异步数据流,因此需要一种适应事件数据特性的新方法。 Method: 利用已知相机姿态将事件数据反向投影到空间,生成视差空间图像(DSIs),并通过3D卷积和循环结构处理局部子区域以预测深度。 Result: 在标准数据集上表现优异:单目数据结果媲美现有立体方法,立体数据下误差降低至少42%,深度完整性提升3倍以上。 Conclusion: 该框架在事件相机深度估计和SLAM中具有潜力,可能成为标准方法。 Abstract: Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net

Lotfi Abdelkrim Mecharbat,Ibrahim Elmakky,Martin Takac,Mohammed Yaqub

Main category: cs.CV

TL;DR: MedNNS是一种针对医学影像的神经架构搜索框架,通过联合优化架构选择和权重初始化,显著提升了模型性能。

Details Motivation: 医学影像任务中,深度学习模型的架构选择和权重初始化是关键挑战,现有方法(如ImageNet迁移学习)效果有限。 Method: MedNNS构建了一个元空间,结合Supernetwork方法,引入rank loss和FID loss优化模型与数据集的匹配。 Result: 实验显示,MedNNS在多个数据集上平均准确率提升1.7%,且收敛速度更快。 Conclusion: MedNNS为医学影像任务提供了一种高效的神经架构搜索解决方案。 Abstract: Deep learning (DL) has achieved remarkable progress in the field of medical imaging. However, adapting DL models to medical tasks remains a significant challenge, primarily due to two key factors: (1) architecture selection, as different tasks necessitate specialized model designs, and (2) weight initialization, which directly impacts the convergence speed and final performance of the models. Although transfer learning from ImageNet is a widely adopted strategy, its effectiveness is constrained by the substantial differences between natural and medical images. To address these challenges, we introduce Medical Neural Network Search (MedNNS), the first Neural Network Search framework for medical imaging applications. MedNNS jointly optimizes architecture selection and weight initialization by constructing a meta-space that encodes datasets and models based on how well they perform together. We build this space using a Supernetwork-based approach, expanding the model zoo size by 51x times over previous state-of-the-art (SOTA) methods. Moreover, we introduce rank loss and Fr\'echet Inception Distance (FID) loss into the construction of the space to capture inter-model and inter-dataset relationships, thereby achieving more accurate alignment in the meta-space. Experimental results across multiple datasets demonstrate that MedNNS significantly outperforms both ImageNet pre-trained DL models and SOTA Neural Architecture Search (NAS) methods, achieving an average accuracy improvement of 1.7% across datasets while converging substantially faster. The code and the processed meta-space is available at https://github.com/BioMedIA-MBZUAI/MedNNS.

[43] Integrating Non-Linear Radon Transformation for Diabetic Retinopathy Grading

Farida Mohsen,Samir Belhaouari,Zubair Shah

Main category: cs.CV

TL;DR: RadFuse是一种多表示深度学习框架,结合非线性RadEx变换的sinogram图像与传统眼底图像,显著提升了糖尿病视网膜病变的检测和分级性能。

Details Motivation: 糖尿病视网膜病变的早期检测和准确分级对预防视力丧失至关重要,但现有方法难以捕捉病变的复杂不规则模式。 Method: RadFuse通过RadEx变换生成sinogram表示,结合空间和变换域信息,利用三种CNN架构(ResNeXt-50、MobileNetV2、VGG19)进行实验。 Result: 在APTOS-2019和DDR数据集上,RadFuse在五级严重度分级中达到93.24%的二次加权kappa,二元分类中准确率达99.09%,优于现有方法。 Conclusion: RadFuse能有效捕捉复杂非线性特征,推动了糖尿病视网膜病变分类及数学变换在医学图像分析中的应用。 Abstract: Diabetic retinopathy is a serious ocular complication that poses a significant threat to patients' vision and overall health. Early detection and accurate grading are essential to prevent vision loss. Current automatic grading methods rely heavily on deep learning applied to retinal fundus images, but the complex, irregular patterns of lesions in these images, which vary in shape and distribution, make it difficult to capture subtle changes. This study introduces RadFuse, a multi-representation deep learning framework that integrates non-linear RadEx-transformed sinogram images with traditional fundus images to enhance diabetic retinopathy detection and grading. Our RadEx transformation, an optimized non-linear extension of the Radon transform, generates sinogram representations to capture complex retinal lesion patterns. By leveraging both spatial and transformed domain information, RadFuse enriches the feature set available to deep learning models, improving the differentiation of severity levels. We conducted extensive experiments on two benchmark datasets, APTOS-2019 and DDR, using three convolutional neural networks (CNNs): ResNeXt-50, MobileNetV2, and VGG19. RadFuse showed significant improvements over fundus-image-only models across all three CNN architectures and outperformed state-of-the-art methods on both datasets. For severity grading across five stages, RadFuse achieved a quadratic weighted kappa of 93.24%, an accuracy of 87.07%, and an F1-score of 87.17%. In binary classification between healthy and diabetic retinopathy cases, the method reached an accuracy of 99.09%, precision of 98.58%, and recall of 99.6%, surpassing previously established models. These results demonstrate RadFuse's capacity to capture complex non-linear features, advancing diabetic retinopathy classification and promoting the integration of advanced mathematical transforms in medical image analysis.

[44] MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

Zhiqiang Wei,Lianqing Zheng,Jianan Liu,Tao Huang,Qing-Long Han,Wenwen Zhang,Fengdeng Zhang

Main category: cs.CV

TL;DR: MS-Occ提出了一种多阶段LiDAR-相机融合框架,通过中间和后期融合结合几何和语义信息,显著提升了3D语义占用感知性能。

Details Motivation: 解决现有视觉方法几何不准确和LiDAR方法语义信息不足的问题,提升自动驾驶在复杂环境中的感知能力。 Method: 采用多阶段融合框架,包括中间阶段的Gaussian-Geo模块和Semantic-Aware模块,以及后期阶段的Adaptive Fusion和HCCVF模块。 Result: 在nuScenes-OpenOccupancy基准测试中,IoU达32.1%,mIoU达25.3%,优于现有方法。 Conclusion: MS-Occ通过模块化设计和创新融合策略,显著提升了3D语义占用感知性能,尤其在小物体感知上表现突出。 Abstract: Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.

[45] Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

Chang Zong,Bin Li,Shoujun Zhou,Jian Wan,Lei Zhang

Main category: cs.CV

TL;DR: 论文提出了一种新任务In-VAL,模拟人与视频的多次交互以获取视觉答案,并提出了Ask2Loc框架来解决语义差距问题。

Details Motivation: 用户通常需要多次交互才能获得符合期望的视频片段,而现有方法未能充分模拟这一过程。 Method: 提出Ask2Loc框架,包含聊天模块、重写模块和搜索模块,分别解决意图模糊、语言不完整和内容碎片化问题。 Result: 在三个重构的In-VAL数据集上,Ask2Loc比传统方法性能提升高达14.91(mIoU)。 Conclusion: Ask2Loc通过模拟交互过程有效解决了In-VAL任务中的语义差距问题,性能显著优于现有方法。 Abstract: Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.

[46] ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

Jian Hu,Dimitrios Korkinof,Shaogang Gong,Mariano Beguerisse-Diaz

Main category: cs.CV

TL;DR: ViSMaP是一种无监督的视频摘要系统,通过元提示技术生成长视频摘要,无需昂贵的人工标注。

Details Motivation: 解决长视频摘要中相关事件稀疏且未分段的问题,避免依赖昂贵的监督训练。 Method: 利用LLMs生成优化的伪摘要,通过元提示策略迭代生成和优化摘要。 Result: 在多个数据集上表现优异,性能接近全监督的先进模型,且能跨领域泛化。 Conclusion: ViSMaP提供了一种高效且低成本的长视频摘要解决方案,无需依赖大量标注数据。 Abstract: We introduce ViSMap: Unsupervised Video Summarisation by Meta Prompting, a system to summarise hour long videos with no-supervision. Most existing video understanding models work well on short videos of pre-segmented events, yet they struggle to summarise longer videos where relevant events are sparsely distributed and not pre-segmented. Moreover, long-form video understanding often relies on supervised hierarchical training that needs extensive annotations which are costly, slow and prone to inconsistency. With ViSMaP we bridge the gap between short videos (where annotated data is plentiful) and long ones (where it's not). We rely on LLMs to create optimised pseudo-summaries of long videos using segment descriptions from short ones. These pseudo-summaries are used as training data for a model that generates long-form video summaries, bypassing the need for expensive annotations of long videos. Specifically, we adopt a meta-prompting strategy to iteratively generate and refine creating pseudo-summaries of long videos. The strategy leverages short clip descriptions obtained from a supervised short video model to guide the summary. Each iteration uses three LLMs working in sequence: one to generate the pseudo-summary from clip descriptions, another to evaluate it, and a third to optimise the prompt of the generator. This iteration is necessary because the quality of the pseudo-summaries is highly dependent on the generator prompt, and varies widely among videos. We evaluate our summaries extensively on multiple datasets; our results show that ViSMaP achieves performance comparable to fully supervised state-of-the-art models while generalising across domains without sacrificing performance. Code will be released upon publication.

[47] A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers

Meng Wang,Tian Lin,Qingshan Hou,Aidi Lin,Jingcheng Wang,Qingsheng Peng,Truong X. Nguyen,Danqi Fang,Ke Zou,Ting Xu,Cancan Xue,Ten Cheer Quek,Qinkai Yu,Minxin Liu,Hui Zhou,Zixuan Xiao,Guiqin He,Huiyu Liang,Tingkun Shi,Man Chen,Linna Liu,Yuanyuan Peng,Lianyu Wang,Qiuming Hu,Junhong Chen,Zhenhua Zhang,Cheng Chen,Yitian Zhao,Dianbo Liu,Jianhua Wu,Xinjian Chen,Changqing Zhang,Triet Thanh Nguyen,Yanda Meng,Yalin Zheng,Yih Chung Tham,Carol Y. Cheung,Huazhu Fu,Haoyu Chen,Ching-Yu Cheng

Main category: cs.CV

TL;DR: GlobeReady是一个无需重新训练即可跨临床中心使用的AI平台,用于眼科疾病诊断,表现高准确性和临床实用性。

Details Motivation: 解决AI模型在不同临床中心部署时需要重新训练的问题,推动AI在医疗影像诊断中的广泛应用。 Method: 采用无需训练的特征增强技术,应对不同中心和人群的数据域偏移,并提供可量化的诊断置信度。 Result: 在多种影像模态和多国临床中心中表现优异,准确率高达93.9-99.4%,临床评分4.6/5。 Conclusion: GlobeReady展示了无需技术障碍的、可扩展的眼科诊断潜力。 Abstract: Artificial intelligence (AI) shows remarkable potential in medical imaging diagnostics, but current models typically require retraining when deployed across different clinical centers, limiting their widespread adoption. We introduce GlobeReady, a clinician-friendly AI platform that enables ocular disease diagnosis without retraining/fine-tuning or technical expertise. GlobeReady achieves high accuracy across imaging modalities: 93.9-98.5% for an 11-category fundus photo dataset and 87.2-92.7% for a 15-category OCT dataset. Through training-free local feature augmentation, it addresses domain shifts across centers and populations, reaching an average accuracy of 88.9% across five centers in China, 86.3% in Vietnam, and 90.2% in the UK. The built-in confidence-quantifiable diagnostic approach further boosted accuracy to 94.9-99.4% (fundus) and 88.2-96.2% (OCT), while identifying out-of-distribution cases at 86.3% (49 CFP categories) and 90.6% (13 OCT categories). Clinicians from multiple countries rated GlobeReady highly (average 4.6 out of 5) for its usability and clinical relevance. These results demonstrate GlobeReady's robust, scalable diagnostic capability and potential to support ophthalmic care without technical barriers.

[48] Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models

Saban Ozturk,Melih B. Yilmaz,Muti Kara,M. Talat Yavuz,Aykut Koç,Tolga Çukur

Main category: cs.CV

TL;DR: MedTrim是一种新型医学视觉语言模型方法,通过多模态三元组学习优化图像-文本对齐,提升细粒度病理属性的区分能力,从而在胸部X光评估中表现优于现有方法。

Details Motivation: 医学影像数据量增长导致专家压力增加,现有对齐方法难以区分细粒度病理属性,影响模型性能。 Method: 提出MedTrim方法,结合疾病类别和病理描述符,通过三元组学习和结构化元实体信息优化对齐。 Result: MedTrim在下游检索和分类任务中表现优于现有方法。 Conclusion: MedTrim通过细粒度对齐提升医学视觉语言模型的性能,具有临床应用潜力。 Abstract: Diagnostic imaging relies on interpreting both images and radiology reports, but the growing data volumes place significant pressure on medical experts, yielding increased errors and workflow backlogs. Medical vision-language models (med-VLMs) have emerged as a powerful framework to efficiently process multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit their performance hinges on how well image and text representations are aligned. Existing alignment methods, predominantly based on contrastive learning, prioritize separation between disease classes over segregation of fine-grained pathology attributes like location, size or severity, leading to suboptimal representations. Here, we propose MedTrim (Meta-entity-driven Triplet mining), a novel method that enhances image-text alignment through multimodal triplet learning synergistically guided by disease class as well as adjectival and directional pathology descriptors. Unlike common alignment methods that separate broad disease classes, MedTrim leverages structured meta-entity information to preserve subtle but clinically significant intra-class variations. For this purpose, we first introduce an ontology-based entity recognition module that extracts pathology-specific meta-entities from CXR reports, as annotations on pathology attributes are rare in public datasets. For refined sample selection in triplet mining, we then introduce a novel score function that captures an aggregate measure of inter-sample similarity based on disease classes and adjectival/directional descriptors. Lastly, we introduce a multimodal triplet alignment objective for explicit within- and cross-modal alignment between samples sharing detailed pathology characteristics. Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.

[49] Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time

Ekaterina Kondrateva,Sandzhi Barg,Mikhail Vasiliev

Main category: cs.CV

TL;DR: 该研究比较了FastSurfer和SynthSeg两种现代脑部分割方法在纵向和多中心数据集上的表现,发现小脑区存在显著体积变化,并提出了改进分割可靠性的方法。

Details Motivation: 结构MRI的准确性和可重复性对监测神经解剖变化至关重要,但扫描仪差异和可重复性问题限制了其应用。 Method: 使用两个数据集(SIMON和SRPBS),通过Dice系数、Surface Dice、Hausdorff距离和MAPE量化分割变异性。 Result: 小脑区(如杏仁核)在控制条件下仍有7-8%的体积变化,挑战了检测5-10%细微变化的能力。 Conclusion: 研究提出了改进分割可靠性的方法,并强调了实际神经影像研究中标准化策略的必要性。 Abstract: Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: https://github.com/kondratevakate/brain-mri-segmentation

[50] Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin,Liyu Jia,Wentao Hu,Kaihang Pan,Zhongqi Yue,Wei Zhao,Jingyuan Chen,Fei Wu,Hanwang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种结合符号推理和强化学习的方法Phys-AR,用于在视频生成中确保物理一致性。通过Diffusion Timestep Tokenizer(DDT)和两阶段训练框架,生成符合物理规律的视频。

Details Motivation: 现有基于扩散的视频生成方法难以处理未见过的物理条件(如速度),因其依赖数据驱动的近似。 Method: 提出Phys-AR框架:1)使用DDT学习离散递归视觉标记,支持符号推理;2)两阶段训练(监督微调和强化学习)优化物理一致性。 Result: 实验表明,Phys-AR能生成物理一致的视频。 Conclusion: 结合符号推理和强化学习,Phys-AR有效解决了视频生成中的物理一致性问题。 Abstract: Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.

[51] FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Zebin Yao,Lei Ren,Huixing Jiang,Chen Wei,Xiaojie Wang,Ruifan Li,Fangxiang Feng

Main category: cs.CV

TL;DR: FreeGraftor是一种无需训练的图像生成框架,通过跨图像特征嫁接实现高效且高保真的主题驱动图像生成。

Details Motivation: 现有方法在主题一致性和效率之间存在权衡,调优方法耗时耗资源,零样本方法则无法保持主题一致性。 Method: 采用语义匹配和位置约束注意力融合,结合噪声初始化策略,实现视觉细节的跨图像转移。 Result: 在主题保真度和文本对齐方面显著优于现有零样本和无训练方法,并可扩展到多主题生成。 Conclusion: FreeGraftor无需调优或额外训练,为实际应用提供了一种高效且高保真的解决方案。 Abstract: Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

[52] Efficient Adaptation of Deep Neural Networks for Semantic Segmentation in Space Applications

Leonardo Olivi,Edoardo Santero Mormile,Enzo Tartaglione

Main category: cs.CV

TL;DR: 论文探讨了在月球和火星地形中,使用适配器进行高效迁移学习以实现岩石分割的可行性,并提出了两种内存节省策略。

Details Motivation: 解决外星环境中标记数据稀缺的问题,同时减少目标设备的带宽和内存需求。 Method: 在预训练骨干模型中集成适配器,采用层融合和适配器排名两种策略。 Result: 适配器成功减少了带宽和内存需求,并在嵌入式设备上评估了性能、内存和计算的权衡。 Conclusion: 研究为外星环境中的高效迁移学习提供了新思路,并指出了未来研究方向。 Abstract: In recent years, the application of Deep Learning techniques has shown remarkable success in various computer vision tasks, paving the way for their deployment in extraterrestrial exploration. Transfer learning has emerged as a powerful strategy for addressing the scarcity of labeled data in these novel environments. This paper represents one of the first efforts in evaluating the feasibility of employing adapters toward efficient transfer learning for rock segmentation in extraterrestrial landscapes, mainly focusing on lunar and martian terrains. Our work suggests that the use of adapters, strategically integrated into a pre-trained backbone model, can be successful in reducing both bandwidth and memory requirements for the target extraterrestrial device. In this study, we considered two memory-saving strategies: layer fusion (to reduce to zero the inference overhead) and an ``adapter ranking'' (to also reduce the transmission cost). Finally, we evaluate these results in terms of task performance, memory, and computation on embedded devices, evidencing trade-offs that open the road to more research in the field.

[53] MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

Yachun Mi,Yu Li,Weicheng Meng,Chaofeng Chen,Chen Hui,Shaohui Liu

Main category: cs.CV

TL;DR: MVQA结合Mamba模型和USDS采样方法,高效完成视频质量评估,性能接近SOTA,速度提升2倍,GPU内存仅需1/5。

Details Motivation: 长时长高清视频的快速增长使得高效视频质量评估(VQA)成为关键挑战,现有方法难以平衡效率与性能。 Method: 提出MVQA模型,基于Mamba,结合USDS采样方法,融合语义和失真信息,通过预定义掩码减少计算负担。 Result: MVQA性能接近SOTA方法,速度提升2倍,GPU内存仅需1/5。 Conclusion: MVQA和USDS为高效VQA提供了新思路,平衡了性能与效率。 Abstract: The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$ GPU memory.

[54] Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

Xinyuan Song,Yangfan He,Sida Li,Jianhui Wang,Hongyang He,Xinhang Yuan,Ruoyu Wang,Jiaqi Chen,Keqin Li,Kuan Lu,Menghao Huo,Binxu Li,Pei Liu

Main category: cs.CV

TL;DR: 本文提出了一个通用的理论框架,用于在DDIM模型中通过适配器保持帧一致性,并分析了其数学性质和稳定性。

Details Motivation: 视频编辑任务需要帧间一致性,而现有适配器方法虽有效但缺乏理论支持。本文旨在填补这一空白。 Method: 通过插入可学习模块到预训练扩散模型,结合共享和帧特定标记的提示学习,并在时间一致性损失下进行理论分析。 Result: 证明了时间一致性目标的可微性和梯度Lipschitz界,展示了梯度下降的收敛性,并分析了DDIM反演中模块的稳定性。 Conclusion: 理论发现增强了基于适配器的扩散视频编辑方法的可靠性,并为视频生成任务提供了理论见解。 Abstract: Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-specific tokens are particularly effective in preserving continuity across frames at low training cost. In this work, we want to provide a general theoretical framework for adapters that maintain frame consistency in DDIM-based models under a temporal consistency loss. First, we prove that the temporal consistency objective is differentiable under bounded feature norms, and we establish a Lipschitz bound on its gradient. Second, we show that gradient descent on this objective decreases the loss monotonically and converges to a local minimum if the learning rate is within an appropriate range. Finally, we analyze the stability of modules in the DDIM inversion procedure, showing that the associated error remains controlled. These theoretical findings will reinforce the reliability of diffusion-based video editing methods that rely on adapter strategies and provide theoretical insights in video generation tasks.

[55] Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang,Xuye Liu,Wei Pang,Li Ma,Shuai Yuan,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: 本文综述了基于扩散模型的视频生成技术,探讨其优势、挑战、技术基础、应用及未来方向。

Details Motivation: 扩散模型在视频生成中展现出优于传统方法的潜力,但面临运动一致性、计算效率和伦理问题等挑战,因此需要全面梳理。 Method: 通过系统分类现有方法,分析架构创新与优化策略,并研究其在低层视觉任务中的应用。 Result: 提供了更广泛、更新的视角,涵盖评估指标、行业解决方案和训练工程技术。 Conclusion: 本文为研究者和从业者提供了理论框架和实践指南,推动该领域发展。 Abstract: Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.

[56] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

Song Wang,Xiaolu Liu,Lingdong Kong,Jianyun Xu,Chunyong Hu,Gongfan Fang,Wentong Li,Jianke Zhu,Xinchao Wang

Main category: cs.CV

TL;DR: PointLoRA是一种结合低秩适应(LoRA)和多尺度令牌选择的方法,用于高效微调点云模型,显著减少可调参数数量,同时保持性能。

Details Motivation: 随着预训练模型复杂度增加,完全微调需要大量计算和存储资源,而现有参数高效微调方法依赖复杂机制,增加了可调参数。 Method: 在点云变换器中嵌入LoRA层以减少可调参数,并结合多尺度令牌选择提取关键局部信息作为下游微调的提示。 Result: 实验表明,PointLoRA仅需3.43%的可调参数即可在多个预训练模型和数据集上取得竞争性性能。 Conclusion: PointLoRA是一种简单高效的方法,适用于资源受限的应用场景。 Abstract: Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: https://github.com/songw-zju/PointLoRA.

[57] LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Joya Chen,Ziyun Zeng,Yiqi Lin,Wei Li,Zejun Ma,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了一种利用自动语音识别(ASR)转录进行视频大语言模型(Video LLM)大规模训练的方法,通过流式训练和时序对齐技术,显著降低了训练成本并提升了性能。

Details Motivation: 当前视频大语言模型依赖昂贵的人工标注或专有模型API(如GPT-4),限制了其大规模训练。本文旨在探索利用低成本ASR转录实现高效训练。 Method: 提出流式训练方法,将ASR词与视频帧按时间戳密集交错,并构建了Live-CC-5M和Live-WhisperX-526K数据集支持训练。 Result: ASR预训练的LiveCC-7B-Base模型在视频问答和实时评论任务中表现优异,最终模型LiveCC-7B-Instruct在多个基准测试中超越更大规模模型。 Conclusion: 该方法展示了ASR转录在大规模视频语言模型训练中的潜力,为低成本高效训练提供了新思路。 Abstract: Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

[58] Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

Frank Li,Hari Trivedi,Bardia Khosravi,Theo Dapamede,Mohammadreza Chavoshi,Abdulhameed Dere,Rohan Satya Isaac,Aawez Mansuri,Janice Newsome,Saptarshi Purkayastha,Judy Gichoya

Main category: cs.CV

TL;DR: 研究评估了三种视觉语言基础模型(RAD-DINO、CheXagent和BiomedCLIP)在放射学任务中的表现,发现预训练方法对下游任务性能有显著影响。

Details Motivation: 探索基础模型在医学影像任务中的应用潜力,尤其是针对细粒度特征捕捉能力。 Method: 评估三种模型在胸部X光片的分类、分割和回归任务中的表现,并设计了一个结合全局和局部特征的自定义分割模型。 Result: RAD-DINO在分割任务中表现最佳,CheXagent在分类任务中表现优异,BiomedCLIP表现不稳定。自定义模型显著提升了所有模型的性能。 Conclusion: 预训练方法对任务性能有重要影响,无文本监督的模型适合分割任务,文本监督的模型在分类和可解释性上更有优势。 Abstract: Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.

[59] Vision language models are unreliable at trivial spatial cognition

Sangeet Khemlani,Tyler Tran,Nathaniel Gyory,Anthony M. Harrison,Wallace E. Lawson,Ravenna Thielstrom,Hunter Thompson,Taaren Singh,J. Gregory Trafton

Main category: cs.CV

TL;DR: VLMs在空间认知任务中的可靠性受提示词微小变化影响,性能不稳定。

Details Motivation: 测试VLMs在简单空间认知任务(如物体左右关系判断)中的可靠性,以评估其实际应用潜力。 Method: 开发TableTest基准数据集,评估主流VLMs在3D场景中的表现。 Result: 提示词的微小逻辑等价变化会显著降低模型性能,揭示空间关系推理的局限性。 Conclusion: VLMs的空间推理能力有待改进,需优化训练数据以提升性能。 Abstract: Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.

[60] Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Theodoros Kouzelis,Efstathios Karypidis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis

Main category: cs.CV

TL;DR: 提出了一种新的生成图像建模框架,通过结合低层次图像潜在表示和高层次语义特征,显著提升了生成质量和训练效率。

Details Motivation: 解决潜在扩散模型(LDMs)在高质量图像生成中,表示学习与生成建模难以结合的问题。 Method: 利用扩散模型联合建模变分自编码器的低层次图像潜在表示和预训练自监督编码器(如DINO)的高层次语义特征。 Result: 在条件和非条件设置下,显著提升了图像质量和训练收敛速度。 Conclusion: 为表示感知的生成建模开辟了新方向,简化了训练并解锁了新的推理策略(如表示引导)。 Abstract: Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

[61] Describe Anything: Detailed Localized Image and Video Captioning

Long Lian,Yifan Ding,Yunhao Ge,Sifei Liu,Hanzi Mao,Boyi Li,Marco Pavone,Ming-Yu Liu,Trevor Darrell,Adam Yala,Yin Cui

Main category: cs.CV

TL;DR: DAM模型通过局部聚焦提示和视觉骨干网络实现高分辨率区域描述,结合半监督学习数据管道解决数据稀缺问题,并在多个基准测试中取得最佳表现。

Details Motivation: 解决视觉语言模型在图像和视频中特定区域生成详细描述的挑战。 Method: 提出DAM模型,结合局部聚焦提示和视觉骨干网络;设计半监督学习数据管道DLC-SDP。 Result: 在7个基准测试中达到最新最优表现。 Conclusion: DAM模型在局部描述任务中表现出色,解决了数据稀缺和描述准确性问题。 Abstract: Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

[62] From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo,Liangbing Zhao,Sayak Paul,Yue Liao,Renrui Zhang,Yi Xin,Peng Gao,Mohamed Elhoseiny,Hongsheng Li

Main category: cs.CV

TL;DR: ReflectionFlow是一个推理时框架,通过噪声级别、提示级别和反射级别的扩展,提升扩散模型生成图像的复杂场景和细节表现。

Details Motivation: 现有文本到图像扩散模型在复杂场景和细节表现上仍有不足,受大语言模型自反思能力启发,提出改进方法。 Method: 提出ReflectionFlow框架,包含噪声级别、提示级别和反射级别的扩展,并利用GenRef数据集进行反射调优。 Result: ReflectionFlow显著优于传统噪声级别扩展方法,提供高效且可扩展的高质量图像合成方案。 Conclusion: ReflectionFlow通过多级扩展和反射调优,显著提升了扩散模型的图像生成质量。 Abstract: Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.

[63] MR. Video: "MapReduce" is the Principle for Long Video Understanding

Ziqi Pang,Yu-Xiong Wang

Main category: cs.CV

TL;DR: MR. Video是一个基于MapReduce原则的长视频理解框架,通过独立感知短片段(Map)和联合聚合信息(Reduce)提升性能。

Details Motivation: 解决现有序列到序列视觉语言模型(VLMs)和视频代理在长视频处理中的上下文长度限制和关键片段依赖问题。 Method: 采用两阶段MapReduce:1)Captioning阶段生成短片段描述并标准化;2)Analysis阶段分析用户问题并整合答案。 Result: 在LVBench上比现有VLMs和视频代理准确率提升超过10%。 Conclusion: MapReduce原则在长视频理解中简单有效,适用于VLMs和视频代理。 Abstract: We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video

[64] MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

Yucheng Li,Huiqiang Jiang,Chengruidong Zhang,Qianhui Wu,Xufang Luo,Surin Ahn,Amir H. Abdi,Dongsheng Li,Jianfeng Gao,Yuqing Yang,Lili Qiu

Main category: cs.CV

TL;DR: MMInference是一种动态稀疏注意力方法,用于加速长上下文多模态输入的预填充阶段,提升视觉语言模型的效率。

Details Motivation: 长上下文能力与视觉理解的结合为视觉语言模型(VLMs)带来巨大潜力,但预填充阶段的二次注意力复杂度阻碍了实际部署。 Method: 通过分析视频输入的时空局部性,发现独特的Grid稀疏模式,并利用基于排列的方法处理模态边界问题。动态构建稀疏分布,并提供优化的GPU内核。 Result: 在多种多模态基准测试中,MMInference将预填充阶段加速高达8.3倍(1M tokens),同时保持准确性。 Conclusion: MMInference无需修改模型即可无缝集成到现有VLM流程中,显著提升了长上下文多模态输入的效率。 Abstract: The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

cs.GR [Back]

[65] Vision6D: 3D-to-2D Interactive Visualization and Annotation Tool for 6D Pose Estimation

Yike Zhang,Eduardo Davalos,Jack Noble

Main category: cs.GR

TL;DR: 本文介绍了一种交互式3D到2D可视化与标注工具Vision6D,用于支持6D姿态估计研究,提供直观的3D界面和视觉提示,帮助用户准确标注物体姿态。

Details Motivation: 6D姿态估计在机器人辅助任务中需求日益增长,但现有工具缺乏交互性和直观性,Vision6D旨在填补这一空白。 Method: 开发了一个交互式工具,支持3D物体在2D场景中的可视化与标注,利用视觉提示和空间关系确定物体位姿,无需已知相机与物体的变换矩阵。 Result: 通过用户研究和与开源数据集的对比,验证了Vision6D能生成准确的姿态标注,且界面直观易用。 Conclusion: Vision6D为6D姿态估计研究提供了高效的工具,开源发布以促进社区发展。 Abstract: Accurate 6D pose estimation has gained more attention over the years for robotics-assisted tasks that require precise interaction with physical objects. This paper presents an interactive 3D-to-2D visualization and annotation tool to support the 6D pose estimation research community. To the best of our knowledge, the proposed work is the first tool that allows users to visualize and manipulate 3D objects interactively on a 2D real-world scene, along with a comprehensive user study. This system supports robust 6D camera pose annotation by providing both visual cues and spatial relationships to determine object position and orientation in various environments. The annotation feature in Vision6D is particularly helpful in scenarios where the transformation matrix between the camera and world objects is unknown, as it enables accurate annotation of these objects' poses using only the camera intrinsic matrix. This capability serves as a foundational step in developing and training advanced pose estimation models across various domains. We evaluate Vision6D's effectiveness by utilizing widely-used open-source pose estimation datasets Linemod and HANDAL through comparisons between the default ground-truth camera poses with manual annotations. A user study was performed to show that Vision6D generates accurate pose annotations via visual cues in an intuitive 3D user interface. This approach aims to bridge the gap between 2D scene projections and 3D scenes, offering an effective way for researchers and developers to solve 6D pose annotation related problems. The software is open-source and publicly available at https://github.com/InteractiveGL/vision6D.

[66] Iris: A Next Generation Digital Pathology Rendering Engine

Ryan Erik Landvater,Ulysses Balis

Main category: cs.GR

TL;DR: Iris是一种高性能数字病理学渲染系统,通过优化的C++和Vulkan技术实现快速图像渲染,显著提升了数字病理学的效率和视觉质量。

Details Motivation: 数字病理学在病理学领域的重要性日益增长,但现有技术在使用便捷性和图像渲染速度上存在不足,限制了其广泛应用。 Method: 开发了Iris Core渲染引擎,采用C++和Vulkan编写,结合快速瓦片缓冲算法,实现了高效的图像处理和渲染。 Result: Iris Core能够在10毫秒内完成新视野的无重叠像素缓冲,并在30毫秒内增强细节,性能远超现有技术。 Conclusion: Iris系统显著提升了数字病理学的渲染速度和视觉质量,具有广泛的应用潜力。 Abstract: Digital pathology is a tool of rapidly evolving importance within the discipline of pathology.Whole slide imaging promises numerous advantages; however, adoption is limited by challenges in ease of use and speed of high-quality image rendering relative to the simplicity and visual quality of glass slides. We introduce Iris, a new high-performance digital pathology rendering system. Specifically, we outline and detail the performance metrics of Iris Core, the core rendering engine technology. Iris Core comprises machine code modules written from the ground up in C++ and using Vulkan, a low-level and low-overhead cross-platform graphical processing unit application program interface, and our novel rapid tile buffering algorithms. We provide a detailed explanation of Iris Core's system architecture, including the stateless isolation of core processes, interprocess communication paradigms, and explicit synchronization paradigms that provide powerful control over the graphical processing unit. Iris Core achieves slide rendering at the sustained maximum frame rate on all tested platforms and buffers an entire new slide field of, view without overlapping pixels, in 10 ms with enhanced detail in 30 ms. It is able to buffer and compute high-fidelity reduction-enhancements for viewing low-power cytology with increased visual quality at a rate of 100-160 {\mu}s per slide tile, and with a cumulative median buffering rate of 1.36 GB of decompressed image data per second. This buffering rate allows for an entirely new field of view to be fully buffered and rendered in less than a single monitor refresh on a standard display, and high detail features within 2-3 monitor refresh frames. These metrics far exceed previously published specifications, beyond an order of magnitude in some contexts. The system shows no slowing with high use loads, but rather increases performance due to cache mechanisms.

[67] Neural Kinematic Bases for Fluids

Yibo Liu,Paul Kry,Kenny Erleben,Noam Aigerman,Sune Darkner,Teseo Schneider

Main category: cs.GR

TL;DR: 提出了一种基于MLP的网格无关流体模拟方法,通过设计损失函数确保神经基满足正交性、无散性、边界对齐和平滑性等物理特性。

Details Motivation: 传统流体模拟方法计算复杂且难以满足实时动画需求,本文旨在通过神经基简化模拟过程并保持物理特性。 Method: 设计损失函数训练MLP生成的神经基,确保其满足正交性、无散性等物理特性,并用于拟合输入流场草图。 Result: 神经基能够适应不同域并自然扩展到三维,支持实时动画生成。 Conclusion: 该方法通过神经基实现了高效且物理准确的流体模拟,适用于实时动画。 Abstract: We propose mesh-free fluid simulations that exploit a kinematic neural basis for velocity fields represented by an MLP. We design a set of losses that ensures that these neural bases satisfy fundamental physical properties such as orthogonality, divergence-free, boundary alignment, and smoothness. Our neural bases can then be used to fit an input sketch of a flow, which will inherit the same fundamental properties from the bases. We then can animate such flow in real-time using standard time integrators. Our neural bases can accommodate different domains and naturally extend to three dimensions.

[68] Low-Rank Adaptation of Neural Fields

Anh Truong,Ahmed H. Mahmoud,Mina Konaković Luković,Justin Solomon

Main category: cs.GR

TL;DR: 提出了一种基于低秩适应(LoRA)的参数高效策略,用于更新神经场(NF),适用于图像过滤、视频压缩和几何编辑等任务。

Details Motivation: 现有图形技术(如法线贴图和视频压缩)利用冗余高效编码小变化,但神经场的小变化编码问题研究较少。 Method: 将LoRA方法从参数高效微调LLM社区引入神经场,避免使用大型预训练模型,适合低计算硬件。 Result: 在图像过滤、视频压缩和几何编辑实验中验证了该方法的有效性和多功能性。 Conclusion: LoRA策略为神经场的小变化编码提供了一种高效且通用的解决方案。 Abstract: Processing visual data often involves small adjustments or sequences of changes, such as in image filtering, surface smoothing, and video storage. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields (NF) -- neural network parameterizations of visual or physical functions -- has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA to instance-specific neural fields, avoiding the need for large pre-trained models yielding a pipeline suitable for low-compute hardware. We validate our approach with experiments in image filtering, video compression, and geometry editing, demonstrating its effectiveness and versatility for representing neural field updates.

[69] SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

Kenan Tang,Yanhong Li,Yao Qin

Main category: cs.GR

TL;DR: SPICE是一种无需训练的工作流,通过结合基础扩散模型和Canny边缘ControlNet模型,能够高效执行自由形式的图像编辑任务,并在100多步编辑中保持图像质量。

Details Motivation: 现有基于提示的图像编辑模型在局部编辑、详细提示遵循和多步编辑中表现不佳,SPICE旨在解决这些问题。 Method: 结合基础扩散模型和Canny边缘ControlNet模型,支持任意分辨率和宽高比,实现自由形式的编辑。 Result: 在语义、风格和结构编辑任务中,SPICE在定量评估和用户偏好上均优于现有方法。 Conclusion: SPICE为图像编辑提供了高效、灵活的解决方案,并支持进一步研究和艺术探索。 Abstract: Recent prompt-based image editing models have demonstrated impressive prompt-following capability at structural editing tasks. However, existing models still fail to perform local edits, follow detailed editing prompts, or maintain global image quality beyond a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and improves image quality consistently during more than 100 editing steps. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. SPICE outperforms state-of-the-art baselines on a challenging realistic image-editing dataset consisting of semantic editing (object addition, removal, replacement, and background change), stylistic editing (texture changes), and structural editing (action change) tasks. Not only does SPICE achieve the highest quantitative performance according to standard evaluation metrics, but it is also consistently preferred by users over existing image-editing methods. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

cs.CL [Back]

[70] Exploring Compositional Generalization (in ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

William Bruns

Main category: cs.CL

TL;DR: 论文通过RASP语言证明Transformer模型可以在ReCOGS_pos任务中实现100%语义匹配,表明任务无需层次化解决方案。

Details Motivation: 研究Transformer模型在组合泛化任务中的表现,尤其是COGS基准测试中表现不佳的结构泛化问题。 Method: 使用RASP编程语言构建Transformer等效模型,通过19个注意力头兼容的扁平模式匹配规则处理ReCOGS_pos任务。 Result: 模型在ReCOGS测试集上实现100%语义匹配,除obj_pp_to_subj_pp外所有泛化分割均100%匹配。 Conclusion: ReCOGS_pos任务可通过扁平规则解决,无需递归或树状结构,展示了Transformer的潜力。 Abstract: Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981's Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to prove by construction that a Transformer encoder-decoder can perform the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 variant of COGS systematically and compositionally: Our RASP model attains 100% semantic exact match on the ReCOGS test set and 100% SEM on all generalization splits except obj_pp_to_subj_pp which gets 92%. Furthermore, our RASP model shows the ReCOGS_pos task does not require a hierarchical or tree-structured solution: we use word-level tokens with an "embedding" layer that tags with possible parts of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules, shown using grammar coverage (Zeller et al., 2023) to be learnable from the training data, plus general prepositional phrase (pp) handling and sentential complement (cp) handling logic, and output the next logical form (LF) token (repeating until the LF is complete). The model does not apply recursive, tree-structured rules like 'np_det pp np -> np_pp -> np', but scores 100% semantic and string exact match on pp recursion, cp recursion using the decoder loop.

[71] Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection

Myrthe Reuver,Indira Sen,Matteo Melis,Gabriella Lapesa

Main category: cs.CL

TL;DR: 研究探讨了性别歧视研究者与大型语言模型(LLMs)的协作,通过四步流程评估LLMs在性别歧视研究中的表现。

Details Motivation: 探索LLMs在性别歧视研究中的潜力,以及专家与AI协作的效果。 Method: 1. 专家回答问题;2. 参与两项交互实验(评估LLM知识和定义性别歧视);3. 零样本分类实验。 Result: LLM生成的定义更长且复杂,专家定义表现较差,但部分专家通过协作定义提升了分类性能。 Conclusion: LLMs在性别歧视研究中具有潜力,专家协作可改善结果,尤其是对LLM经验不足的专家。 Abstract: This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with a four-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model's knowledge about sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: an expert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.

[72] Trillion 7B Technical Report

Sungjun Han,Juyoung Suk,Suyeong An,Hyungguk Kim,Kyuseok Kim,Wonsuk Yang,Seungtaek Choi,Jamin Shin

Main category: cs.CL

TL;DR: Trillion-7B是一种高效的韩语为中心的多语言大模型,通过XLDA机制和优化数据混合实现高性能,仅用10%的多语言数据和较少资源训练。

Details Motivation: 解决多语言大模型训练中资源消耗高和知识转移效率低的问题。 Method: 采用跨语言文档注意力机制(XLDA)、优化数据混合、语言特定过滤和定制分词器。 Result: 在27个基准测试中表现优异,跨语言一致性突出。 Conclusion: Trillion-7B展示了高效的多语言模型训练方法,性能优越且资源消耗低。 Abstract: We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.

[73] Feeding LLM Annotations to BERT Classifiers at Your Own Risk

Yucheng Lu,Kazimier Smith

Main category: cs.CL

TL;DR: 使用LLM生成的标签微调小型编码器模型在文本分类中流行,但研究发现合成数据训练会导致性能下降、不稳定和过早性能停滞,需谨慎用于高风险任务。

Details Motivation: 探讨在文本分类中使用LLM生成标签微调小型模型的可靠性,尤其是在高风险应用中。 Method: 通过实证分析比较基于合成数据和黄金标签训练的模型性能,并分析误差传播现象。 Result: 合成数据训练导致准确性、F1分数下降,训练不稳定,性能过早停滞。 Conclusion: 需谨慎使用LLM生成标签微调模型,建议采用熵过滤和集成技术缓解问题,但无法完全消除风险。 Abstract: Using LLM-generated labels to fine-tune smaller encoder-only models for text classification has gained popularity in various settings. While this approach may be justified in simple and low-stakes applications, we conduct empirical analysis to demonstrate how the perennial curse of training on synthetic data manifests itself in this specific setup. Compared to models trained on gold labels, we observe not only the expected performance degradation in accuracy and F1 score, but also increased instability across training runs and premature performance plateaus. These findings cast doubts on the reliability of such approaches in real-world applications. We contextualize the observed phenomena through the lens of error propagation and offer several practical mitigation strategies, including entropy-based filtering and ensemble techniques. Although these heuristics offer partial relief, they do not fully resolve the inherent risks of propagating non-random errors from LLM annotations to smaller classifiers, underscoring the need for caution when applying this workflow in high-stakes text classification tasks.

[74] Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

Tyler A. Chang,Benjamin K. Bergen

Main category: cs.CL

TL;DR: 论文研究了Transformer语言模型中用于预测下一个词的最小子网络(bigram subnetworks),发现这些子网络在训练好的模型中存在且对性能至关重要,仅占不到0.2%的参数。

Details Motivation: 探索语言模型中从当前词嵌入到下一个词预测的最小转换机制,以理解模型的基本工作方式。 Method: 识别并分析语言模型中的bigram子网络,这些子网络仅基于当前词预测下一个词。 Result: bigram子网络在1B参数的模型中存在,集中于第一层MLP,且与最优剪枝子网络重叠。它们对模型性能至关重要。 Conclusion: bigram子网络是语言模型中实现基本预测的最小必要子集,为研究模型电路提供了新方法。 Abstract: In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying language model circuits by building up from a minimal circuit rather than the traditional approach of ablating circuits from a full model.

[75] Speculative Sampling via Exponential Races

Szymon Kobus,Deniz Gündüz

Main category: cs.CL

TL;DR: 本文通过建立推测解码与信道模拟之间的联系,提出了一种信息论分析框架,并推导了生成速度提升与草稿模型生成令牌数k的关系,同时提出了一种新的推测解码方法ERSD。

Details Motivation: 探讨推测解码与信道模拟之间的理论联系,以信息论为基础分析推测解码的加速潜力。 Method: 通过建立推测解码与信道模拟的联系,推导速度提升与令牌数k的关系,并提出新的推测解码方法ERSD。 Result: 推导出生成速度提升与k的显式关系,并验证ERSD方法的性能达到当前最优水平。 Conclusion: 本文为推测解码提供了理论支持,并提出了一种高效的新方法ERSD。 Abstract: Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$, which serves as an upper bound for all $k$. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.

[76] SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Keqi Deng,Wenxi Chen,Xie Chen,Philip C. Woodland

Main category: cs.CL

TL;DR: 论文提出SimulS2S-LLM方法,通过离线训练语音大模型和测试时策略实现流式语音翻译,提升翻译质量和延迟的平衡。

Details Motivation: 解决流式语音翻译中因语音作为提示导致训练与推理不匹配的问题。 Method: 离线训练语音大模型,提取边界感知语音提示,设计增量束搜索预测语音标记。 Result: 在CVSS语音数据上,SimulS2S-LLM在相同延迟下比现有方法提高ASR-BLEU分数3分。 Conclusion: SimulS2S-LLM有效提升了流式语音翻译的性能和延迟平衡。 Abstract: Simultaneous speech translation (SST) outputs translations in parallel with streaming speech input, balancing translation quality and latency. While large language models (LLMs) have been extended to handle the speech modality, streaming remains challenging as speech is prepended as a prompt for the entire generation process. To unlock LLM streaming capability, this paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference. SimulS2S-LLM alleviates the mismatch between training and inference by extracting boundary-aware speech prompts that allows it to be better matched with text input data. SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder. An incremental beam search is designed to expand the search space of speech token prediction without increasing latency. Experiments on the CVSS speech data show that SimulS2S-LLM offers a better translation quality-latency trade-off than existing methods that use the same training data, such as improving ASR-BLEU scores by 3 points at similar latency.

[77] The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

Minghao Wu,Weixuan Wang,Sinuo Liu,Huifeng Yin,Xintong Wang,Yu Zhao,Chenyang Lyu,Longyue Wang,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: 论文分析了2000多个多语言基准测试,发现英语仍占主导地位,且翻译基准与本地化基准在人类评估中的相关性差异显著。

Details Motivation: 推动多语言评估的公平性,揭示当前多语言基准测试的局限性。 Method: 分析了148个国家2000多个多语言基准测试,比较其与人类评估的相关性。 Result: 英语过度代表,本地化基准优于翻译基准,STEM任务相关性高于传统NLP任务。 Conclusion: 提出多语言基准测试的指导原则和研究方向,呼吁全球合作开发更贴近现实的基准。 Abstract: As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.

[78] IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

Qiyao Wang,Guhong Chen,Hongbo Wang,Huaren Liu,Minghui Zhu,Zhifei Qin,Linwei Li,Yilin Yue,Shiqiang Wang,Jiayan Li,Yihang Wu,Ziqiang Liu,Longze Chen,Run Luo,Liyang Fan,Jiaming Li,Lei Zhang,Kan Xu,Hongfei Lin,Hamid Alinejad-Rokny,Shiwen Ni,Yuan Lin,Min Yang

Main category: cs.CL

TL;DR: 论文提出了首个全面的IP任务分类和大规模双语基准IPBench,用于评估LLMs在知识产权领域的实际应用表现,发现现有模型仍有较大改进空间。

Details Motivation: 知识产权领域复杂且知识密集,现有数据集和基准未能全面覆盖实际场景,需要更全面的评估工具。 Method: 引入IPBench基准,涵盖8种IP机制和20项任务,评估16种LLMs的表现。 Result: 最佳模型准确率仅为75.8%,开源和法律导向模型表现落后于闭源通用模型。 Conclusion: IPBench填补了知识产权领域评估工具的空白,未来将持续更新以更好地反映实际挑战。 Abstract: Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench, covering 8 IP mechanisms and 20 tasks. This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation. We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy, revealing substantial room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. We publicly release all data and code of IPBench and will continue to update it with additional IP-related tasks to better reflect real-world challenges in the intellectual property domain.

[79] Compass-V2 Technical Report

Sophia Maria

Main category: cs.CL

TL;DR: Compass-v2是一个轻量级的Mixture-of-Experts模型,专为东南亚语言和电子商务应用设计,通过高质量数据集和混合推理模型实现高性能和低成本。

Details Motivation: 解决高资源语言主导的LLMs对东南亚低资源语言和电子商务领域的不足。 Method: 设计30B总参数、5B活跃参数的MoE模型,结合细粒度与共享专家模块,构建高质量数据集,并开发混合推理模型。 Result: 在30B以下模型中表现出东南亚多语言和电子商务领域的最先进性能,同时显著降低推理成本。 Conclusion: Compass-v2成功填补了低资源语言和电子商务领域的空白,展示了高效且经济的解决方案。 Abstract: Predominant LLMs focus on high-resource languages while leaving low-resource languages, particularly those in Southeast Asia (SEA), underrepresented. In addition, those models are general-purpose and pay limited attention to the e-commerce domain. To overcome these limitations, we introduce Compass-v2, a lightweight Mixture-of-Experts (MoE) model specifically designed for Southeast Asian languages and e-commerce applications. To balance model performance and inference cost, the model is designed with 30B total parameters and 5B active parameters, incorporating both fine-grained and shared expert modules. To enhance multilingual performance, we curated and constructed a high-quality, industry-leading SEA dataset, to the best of our knowledge. To boost performance in the e-commerce domain, we built a dataset comprising hundreds of billions of tokens, sourced through external data mining and internal platform collection. Besides, we pioneered a hybrid reasoning model that supports both fast thinking and deep thinking within a unified framework to enhance the reasoning capabilities, diverging from the conventional industry practice of deploying two separate models. Through extensive experimental evaluations, our model demonstrates state-of-the-art SEA multilingual and e-commerce performance among sub-30B models, while maintaining significantly lower inference cost.

[80] llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

Issa Sugiura,Kouta Nakayama,Yusuke Oda

Main category: cs.CL

TL;DR: 论文介绍了llm-jp-modernbert,一种在日语语料上预训练的ModernBERT模型,支持长上下文(8192 tokens),在填充掩码测试中表现良好,但未超越现有下游任务基线。

Details Motivation: 探索大规模语料和长上下文下编码器模型的预训练,填补与解码器模型相比的研究空白。 Method: 使用公开的大规模日语语料库预训练ModernBERT模型,扩展上下文长度至8192 tokens,并通过填充掩码测试和伪困惑度实验分析效果。 Result: 模型在填充掩码测试中表现良好,但在下游任务中未超越基线;通过实验验证了上下文长度扩展的影响,并分析了句子嵌入的变化。 Conclusion: llm-jp-modernbert为长上下文BERT的发展提供了支持,并公开了模型和代码以促进可复现性。 Abstract: Encoder-only transformer models like BERT are widely adopted as a pre-trained backbone for tasks like sentence classification and retrieval. However, pretraining of encoder models with large-scale corpora and long contexts has been relatively underexplored compared to decoder-only transformers. In this work, we present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens. While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations. We also analyze the effect of context length expansion through pseudo-perplexity experiments. Furthermore, we investigate sentence embeddings in detail, analyzing their transitions during training and comparing them with those from other existing models, confirming similar trends with models sharing the same architecture. To support reproducibility and foster the development of long-context BERT, we release our model, along with the training and evaluation code.

[81] LLM-based Semantic Augmentation for Harmful Content Detection

Elyas Meguellati,Assaad Zeghina,Shazia Sadiq,Gianluca Demartini

Main category: cs.CL

TL;DR: 论文提出了一种利用大语言模型(LLM)进行文本预处理和语义增强的方法,以提升复杂社交媒体任务(如宣传检测、仇恨内容分类等)的性能,效果接近人工标注数据,但成本更低。

Details Motivation: 当前LLM在简单文本分类任务中表现优异,但在复杂社交媒体任务中效果不佳,且现有研究多关注生成合成数据,忽略了LLM在文本预处理和语义增强中的潜力。 Method: 通过提示LLM清理噪声文本并提供上下文丰富的解释,增强训练集质量,而不显著增加数据量。在多个数据集上进行了系统性评估。 Result: 零样本LLM分类在高上下文任务中表现不佳,但结合LLM语义增强后,性能接近依赖人工标注数据的方法,且成本更低。 Conclusion: 策略性地将LLM整合到机器学习流程中,对社交媒体分类任务具有重要意义,为在线有害内容治理提供了新思路。 Abstract: Recent advances in large language models (LLMs) have demonstrated strong performance on simple text classification tasks, frequently under zero-shot settings. However, their efficacy declines when tackling complex social media challenges such as propaganda detection, hateful meme classification, and toxicity identification. Much of the existing work has focused on using LLMs to generate synthetic training data, overlooking the potential of LLM-based text preprocessing and semantic augmentation. In this paper, we introduce an approach that prompts LLMs to clean noisy text and provide context-rich explanations, thereby enhancing training sets without substantial increases in data volume. We systematically evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and further validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets to assess generalizability. Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models. In contrast, integrating LLM-based semantic augmentation yields performance on par with approaches that rely on human-annotated data, at a fraction of the cost. These findings underscore the importance of strategically incorporating LLMs into machine learning (ML) pipeline for social media classification tasks, offering broad implications for combating harmful content online.

[82] Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Yuxin Jiang,Yufei Wang,Chuhan Wu,Xinyi Dai,Yan Xu,Weinan Gan,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Wei Wang

Main category: cs.CL

TL;DR: WebR是一个自动化框架,直接从原始网页文档合成高质量的指令调优数据,无需依赖种子数据质量或强假设,显著提升LLMs的指令跟随能力。

Details Motivation: 现有自动数据合成方法依赖种子数据质量或强假设,限制了高质量指令-响应对的生成。 Method: 提出WebR框架,通过双视角范式(Web作为指令和Web作为响应)从原始网页文档合成数据。 Result: WebR生成的数据在四个指令跟随基准上优于现有方法16.65%,并展示出更好的兼容性、数据效率和可扩展性。 Conclusion: WebR为高质量指令调优数据合成提供了高效、可扩展的解决方案,显著提升LLMs性能。 Abstract: The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.

[83] Exploring Next Token Prediction in Theory of Mind (ToM) Tasks: Comparative Experiments with GPT-2 and LLaMA-2 AI Models

Pavan Yadav,Nikhil Khandalkar,Krishna Shinde,Lokesh B. Ramegowda,Rajarshi Das

Main category: cs.CL

TL;DR: 该研究比较了GPT-2和Llama-2-7b-chat-hf在心理理论任务中的表现,发现增加上下文复杂性会略微降低预测准确性,而Llama-2在低温度下表现更优。

Details Motivation: 评估语言模型在心理理论任务中的表现,探讨上下文复杂性和温度设置对模型预测能力的影响。 Method: 构建基于心理理论故事的增强数据集,测试模型在不同温度和推理层级下的表现。 Result: Llama-2表现优于GPT-2,增加上下文复杂性会降低准确性,模型在高阶推理任务中表现差异更大。 Conclusion: 模型架构、温度和上下文复杂性显著影响预测能力,揭示了当前语言模型的优缺点。 Abstract: Language models have made significant progress in generating coherent text and predicting next tokens based on input prompts. This study compares the next-token prediction performance of two well-known models: OpenAI's GPT-2 and Meta's Llama-2-7b-chat-hf on Theory of Mind (ToM) tasks. To evaluate their capabilities, we built a dataset from 10 short stories sourced from the Explore ToM Dataset. We enhanced these stories by programmatically inserting additional sentences (infills) using GPT-4, creating variations that introduce different levels of contextual complexity. This setup enables analysis of how increasing context affects model performance. We tested both models under four temperature settings (0.01, 0.5, 1.0, 2.0) and evaluated their ability to predict the next token across three reasoning levels. Zero-order reasoning involves tracking the state, either current (ground truth) or past (memory). First-order reasoning concerns understanding another's mental state (e.g., "Does Anne know the apple is salted?"). Second-order reasoning adds recursion (e.g., "Does Anne think that Charles knows the apple is salted?"). Our results show that adding more infill sentences slightly reduces prediction accuracy, as added context increases complexity and ambiguity. Llama-2 consistently outperforms GPT-2 in prediction accuracy, especially at lower temperatures, demonstrating greater confidence in selecting the most probable token. As reasoning complexity rises, model responses diverge more. Notably, GPT-2 and Llama-2 display greater variability in predictions during first- and second-order reasoning tasks. These findings illustrate how model architecture, temperature, and contextual complexity influence next-token prediction, contributing to a better understanding of the strengths and limitations of current language models.

[84] Exploiting Contextual Knowledge in LLMs through V-usable Information based Layer Enhancement

Xiaowei Yuan,Zhao Yang,Ziyang Huang,Yequan Wang,Siqi Fan,Yiming Ju,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为CaLE的新方法,通过增强LLMs内部表示中的上下文知识利用,改进了上下文忠实生成能力。

Details Motivation: 现有方法忽视了LLMs内部状态中上下文信息的处理机制,导致其无法充分利用上下文知识。 Method: 提出Context-aware Layer Enhancement (CaLE),通过V-usable信息分析在最优层增强上下文信息,从而丰富最终层的表示。 Result: 实验表明,CaLE在问答任务中有效提升了上下文忠实生成能力,尤其在涉及未知或冲突上下文知识的场景中。 Conclusion: CaLE通过优化内部表示,显著提升了LLMs的上下文忠实生成能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs' internal states. As a result, LLMs remain limited in their ability to fully leverage contextual knowledge. In this paper, we propose Context-aware Layer Enhancement (CaLE), a novel intervention method that enhances the utilization of contextual knowledge within LLMs' internal representations. By employing V-usable information analysis, CaLE strategically amplifies the growth of contextual information at an optimal layer, thereby enriching representations in the final layer. Our experiments demonstrate that CaLE effectively improves context-faithful generation in Question-Answering tasks, particularly in scenarios involving unknown or conflicting contextual knowledge.

[85] Cost-Effective Text Clustering with Large Language Models

Hongtao Wang,Taiyan Zhang,Renchi Yang,Jianliang Xu

Main category: cs.CL

TL;DR: TECL是一个成本效益高的框架,通过利用LLM反馈在有限查询预算内实现准确的文本聚类。

Details Motivation: 解决LLM在文本聚类中因大量查询导致的计算和财务开销问题。 Method: 采用EdgeLLM或TriangleLLM构建文本对的约束条件,并通过加权约束聚类生成聚类结果。 Result: 在多个基准数据集上,TECL在相同查询成本下显著优于现有方法。 Conclusion: TECL是一种高效且经济的文本聚类解决方案。 Abstract: Text clustering aims to automatically partition a collection of text documents into distinct clusters based on linguistic features. In the literature, this task is usually framed as metric clustering based on text embeddings from pre-trained encoders or a graph clustering problem upon pairwise similarities from an oracle, e.g., a large ML model. Recently, large language models (LLMs) bring significant advancement in this field by offering contextualized text embeddings and highly accurate similarity scores, but meanwhile, present grand challenges to cope with substantial computational and/or financial overhead caused by numerous API-based queries or inference calls to the models. In response, this paper proposes TECL, a cost-effective framework that taps into the feedback from LLMs for accurate text clustering within a limited budget of queries to LLMs. Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs, and further leverages such constraints as supervision signals input to our weighted constrained clustering approach to generate clusters. Particularly, EdgeLLM (resp. TriangleLLM) enables the identification of informative text pairs (resp. triplets) for querying LLMs via well-thought-out greedy algorithms and accurate extraction of pairwise constraints through carefully-crafted prompting techniques. Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering under the same query cost for LLMs.

[86] Computational Typology

Gerhard Jäger

Main category: cs.CL

TL;DR: 本文探讨了计算统计模型在语言类型学研究中的应用及其优势。

Details Motivation: 语言类型学旨在通过结构特征分类语言,揭示语言的共性与多样性。近年来,计算方法的引入为大规模语言数据分析提供了新工具。 Method: 采用计算统计建模方法,分析大规模语言数据,验证语言结构与演化的假设。 Result: 计算统计模型显著提升了语言类型学研究的效率和准确性。 Conclusion: 计算方法的引入为语言类型学研究开辟了新途径,具有重要价值。 Abstract: Typology is a subfield of linguistics that focuses on the study and classification of languages based on their structural features. Unlike genealogical classification, which examines the historical relationships between languages, typology seeks to understand the diversity of human languages by identifying common properties and patterns, known as universals. In recent years, computational methods have played an increasingly important role in typological research, enabling the analysis of large-scale linguistic data and the testing of hypotheses about language structure and evolution. This article provides an illustration of the benefits of computational statistical modeling in typology.

[87] FinTextSim: Enhancing Financial Text Analysis with BERTopic

Simon Jehnen,Joaquín Ordieres-Meré,Javier Villalba-Díez

Main category: cs.CL

TL;DR: 研究探讨了BERTopic与FinTextSim结合在金融文本分析中的有效性,显著提升了主题聚类的清晰度。

Details Motivation: 信息可用性和计算能力的进步促使金融文本分析需要更高效的工具。 Method: 使用BERTopic和FinTextSim分析S&P 500公司的10-K文件,比较其性能。 Result: FinTextSim显著提升主题内相似性并降低主题间相似性,BERTopic仅在与FinTextSim结合时表现良好。 Conclusion: FinTextSim对金融文本分析至关重要,能提升研究质量和决策效率。 Abstract: Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract valuable insights from this wealth of textual data, automated review processes, such as topic modeling, are crucial. This study examines the effectiveness of BERTopic, a state-of-the-art topic model relying on contextual embeddings, for analyzing Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2022). Moreover, we introduce FinTextSim, a finetuned sentence-transformer model optimized for clustering and semantic search in financial contexts. Compared to all-MiniLM-L6-v2, the most widely used sentence-transformer, FinTextSim increases intratopic similarity by 81% and reduces intertopic similarity by 100%, significantly enhancing organizational clarity. We assess BERTopic's performance using embeddings from both FinTextSim and all-MiniLM-L6-v2. Our findings reveal that BERTopic only forms clear and distinct economic topic clusters when paired with FinTextSim's embeddings. Without FinTextSim, BERTopic struggles with misclassification and overlapping topics. Thus, FinTextSim is pivotal for advancing financial text analysis. FinTextSim's enhanced contextual embeddings, tailored for the financial domain, elevate the quality of future research and financial information. This improved quality of financial information will enable stakeholders to gain a competitive advantage, streamlining resource allocation and decision-making processes. Moreover, the improved insights have the potential to leverage business valuation and stock price prediction models.

[88] Subject islands do not reduce to construction-specific discourse function

Mandy Cartner,Matthew Kogan,Nikolas Webster,Matthew Wagers,Ivy Sichel

Main category: cs.CL

TL;DR: 论文探讨了语言学中的'岛屿效应',特别是主语作为岛屿的现象,并通过实验验证了不同句法结构中主语岛屿效应的普遍性,支持了句法自主性的观点。

Details Motivation: 研究旨在验证主语岛屿效应是否仅与特定信息结构(如疑问句)相关,还是普遍存在于不同句法结构中,以探讨句法是否独立于交际功能。 Method: 通过三个大规模可接受性研究,使用超加性设计,分别在疑问句、关系从句和话题化结构中测试主语岛屿效应。 Result: 实验结果显示,主语岛屿效应在所有测试结构中均存在,而不仅限于疑问句,支持句法自主性的观点。 Conclusion: 研究结果表明主语岛屿效应与抽象句法表征相关,独立于交际功能,支持生成语法传统中的句法自主性理论。 Abstract: The term islands in linguistics refers to phrases from which extracting an element results in ungrammaticality (Ross, 1967). Grammatical subjects are considered islands because extracting a sub-part of a subject results in an ill-formed sentence, despite having a clear intended meaning (e.g., "Which topic did the article about inspire you?"). The generative tradition, which views syntax as autonomous of meaning and function, attributes this ungrammaticality to the abstract movement dependency between the wh-phrase and the subject-internal position with which it is associated for interpretation. However, research on language that emphasizes its communicative function suggests instead that syntactic constraints, including islands, can be explained based on the way different constructions package information. Accordingly, Abeill\'e et al. (2020) suggest that the islandhood of subjects is specific to the information structure of wh-questions, and propose that subjects are not islands for movement, but for focusing, due to their discourse-backgroundedness. This predicts that other constructions that differ in their information structure from wh-questions, but still involve movement, should not create a subject island effect. We test this prediction in three large-scale acceptability studies, using a super-additive design that singles out subject island violations, in three different constructions: wh-questions, relative clauses, and topicalization. We report evidence for a subject island effect in each construction type, despite only wh-questions introducing what Abeill\'e et al. (2020) call "a clash in information structure." We argue that this motivates an account of islands in terms of abstract, syntactic representations, independent of the communicative function associated with the constructions.

[89] Tina: Tiny Reasoning Models via LoRA

Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Willie Neiswanger

Main category: cs.CL

TL;DR: Tina是一种小型推理模型家族,通过低成本高效的方式实现强大的推理能力,使用LoRA进行参数高效更新,性能媲美甚至超越SOTA模型,且成本极低。

Details Motivation: 探索如何在语言模型中低成本高效地实现强大的推理能力。 Method: 采用低秩适应(LoRA)对1.5B参数的基模型进行强化学习(RL)微调。 Result: Tina模型在推理性能上媲美或超越SOTA模型,成本仅为9美元,性能提升20%,Pass@1准确率达43.33%。 Conclusion: LoRA在高效RL推理中表现出色,快速适应推理结构,同时保留基模型知识,代码和模型已开源。 Abstract: How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a >20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \& checkpoints.

[90] Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach

Ruizhe Li,Chiwei Zhu,Benfeng Xu,Xiaorui Wang,Zhendong Mao

Main category: cs.CL

TL;DR: 提出了一种基于TTCW的自动化评估方法,用于评估LLM生成文本的创造力,显著提高了与人类评估的一致性。

Details Motivation: 评估机器生成文本的创造力是一个挑战,现有方法成本高或与人类评估不一致。 Method: 采用基于参考的Likert风格方法,对生成文本与高质量参考文本进行评分。 Result: 实验显示,该方法显著提高了与人类评估的一致性,配对准确率达到0.75(提升15%)。 Conclusion: 该方法为自动化评估LLM创造力提供了一种有效解决方案。 Abstract: Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15\%).

[91] A closer look at how large language models trust humans: patterns and biases

Valeria Lerman,Yaniv Dover

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLM)如何基于人类的能力、善意和诚信三个维度建立信任,并发现其信任模式与人类相似,但也存在人口统计偏见。

Details Motivation: 随着LLM在决策中与人类互动增多,理解AI对人类信任的动态成为关键。现有研究多关注人类对AI的信任,而AI对人类的信任机制尚不明确。 Method: 基于行为理论,研究了LLM是否依赖能力、善意和诚信三个信任维度,并探讨人口统计变量对信任的影响。通过43,200次模拟实验,分析了五种流行模型在五种场景下的表现。 Result: LLM的信任形成与人类相似,但某些情况下受年龄、宗教和性别偏见影响,尤其在金融场景中。不同模型的信任评估方式存在差异。 Conclusion: 需进一步理解AI对人类信任的动态,并监控偏见和信任模式,以避免在信任敏感应用中产生潜在危害。 Abstract: As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern. While considerable literature studies how humans trust AI agents, it is much less understood how LLM-based agents develop effective trust in humans. LLM-based agents likely rely on some sort of implicit effective trust in trust-related contexts (e.g., evaluating individual loan applications) to assist and affect decision making. Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject. We also study how demographic variables affect effective trust. Across 43,200 simulated experiments, for five popular language models, across five different scenarios we find that LLM trust development shows an overall similarity to human trust development. We find that in most, but not all cases, LLM trust is strongly predicted by trustworthiness, and in some cases also biased by age, religion and gender, especially in financial scenarios. This is particularly true for scenarios common in the literature and for newer models. While the overall patterns align with human-like mechanisms of effective trust formation, different models exhibit variation in how they estimate trust; in some cases, trustworthiness and demographic factors are weak predictors of effective trust. These findings call for a better understanding of AI-to-human trust dynamics and monitoring of biases and trust development patterns to prevent unintended and potentially harmful outcomes in trust-sensitive applications of AI.

[92] What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns

Michael A. Hedderich,Anyi Wang,Raoyuan Zhao,Florian Eichin,Barbara Plank

Main category: cs.CL

TL;DR: 提出了一种名为Spotlight的新方法,结合自动化和人工分析,通过数据挖掘技术区分语言模型输出的随机变化与系统性差异,帮助用户高效分析提示和模型变化的影响。

Details Motivation: 现有评估方法(自动化指标或人工评估)存在局限性,无法全面或高效地分析语言模型输出的系统性差异。 Method: 基于数据挖掘技术,自动提取描述系统性差异的标记模式,并指导用户手动分析提示和模型变化的影响。 Result: 通过三个基准测试验证了标记模式提取方法的可靠性,并展示了该方法对现有提示数据的新见解。 Conclusion: Spotlight方法从人本视角出发,帮助用户理解语言模型输出的系统性差异,支持提示工程和人本模型行为研究。 Abstract: Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompt and model changes efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs, and we are able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.

[93] Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan,Wei Shen,Shulin Huang,Qiji Zhou,Yue Zhang

Main category: cs.CL

TL;DR: Pre-DPO是一种改进的偏好优化方法,通过引入指导性参考模型提升DPO和SimPO的性能。

Details Motivation: 现有DPO和SimPO方法在数据利用效率和训练鲁棒性上存在不足。 Method: 提出Pre-DPO,利用参考模型动态调整样本权重,优化训练过程。 Result: 在AlpacaEval 2.0和Arena-Hard v0.1基准测试中表现优于DPO和SimPO。 Conclusion: Pre-DPO有效提升了偏好优化的性能,无需额外数据或外部模型。 Abstract: Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

[94] Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

Luwei Xiao,Rui Mao,Shuai Zhao,Qika Lin,Yanhao Jia,Liang He,Erik Cambria

Main category: cs.CL

TL;DR: 该论文提出了一种名为Chimera的多模态情感分类框架,通过结合视觉和文本特征,以及情感认知共振,提升了对特定目标的情感极性预测能力。

Details Motivation: 由于社交媒体上多模态内容的增加,现有方法在理解细粒度视觉内容和情感认知机制方面存在不足,因此需要一种更全面的框架。 Method: Chimera框架整合了视觉块特征、粗粒度与细粒度视觉特征,并将其转化为文本描述,同时利用大语言模型生成情感原因和印象。 Result: 实验表明,该模型在标准数据集上表现优异,且比GPT-4o等大语言模型更灵活。 Conclusion: Chimera通过结合语义和情感认知共振,显著提升了多模态情感分类的性能,并公开了实现和数据集。 Abstract: Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera

[95] Dynamic Early Exit in Reasoning Models

Chenxu Yang,Qingyi Si,Yongjie Duan,Zheliang Zhu,Chenyu Zhu,Zheng Lin,Li Cao,Weiping Wang

Main category: cs.CL

TL;DR: 提出一种自截断链式思维(CoT)的方法,通过动态退出减少冗余推理步骤,提高效率和准确性。

Details Motivation: 长链式思维可能导致效率低下和准确性损失,需要一种动态截断方法。 Method: 监测模型行为,在推理过渡点动态终止生成,无需额外训练。 Result: 在多个基准测试中,CoT序列长度减少31%-43%,准确性提升1.7%-5.7%。 Conclusion: 该方法简单有效,适用于现有推理模型,显著提升性能。 Abstract: Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,"Wait" tokens) and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on multiple reasoning benchmarks MATH-500, AMC 2023, GPQA Diamond and AIME 2024 show that the proposed method is consistently effective on deepseek-series reasoning LLMs, reducing the length of CoT sequences by an average of 31% to 43% while improving accuracy by 1.7% to 5.7%.

[96] SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

Cheng Wen,Tingwei Guo,Shuaijiang Zhao,Wei Zou,Xiangang Li

Main category: cs.CL

TL;DR: 该论文通过强化学习(RL)提升大型音频语言模型(LALM)的推理能力,提出结构化音频推理模型SARI,在多个基准测试中表现优异。

Details Motivation: 探索强化学习如何提升音频语言模型的推理能力,填补该领域的研究空白。 Method: 采用两阶段训练:监督微调(SFT)和课程引导的GRPO强化学习,比较结构化与非结构化推理。 Result: SARI模型在基准测试中平均准确率提升16.35%,Qwen2.5-Omni变体达到67.08%的SOTA性能。 Conclusion: 结构化推理和课程学习显著增强音频语言理解能力。 Abstract: Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to "think before answering." Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored. We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus. Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures. Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance. These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.

[97] FairTranslate: An English-French Dataset for Gender Bias Evaluation in Machine Translation by Overcoming Gender Binarity

Fanny Jourdan,Yannick Chevalier,Cécile Favre

Main category: cs.CL

TL;DR: FairTranslate是一个新数据集,用于评估LLM在英语到法语翻译中处理非二元性别偏见的能力,结果显示现有模型存在显著偏见。

Details Motivation: LLM在翻译包容性语言(如单数'they'代词)时表现不佳,需系统性评估其性别偏见。 Method: 创建FairTranslate数据集(2418句对),评估四种LLM在不同提示下的表现。 Result: LLM在性别表示上存在显著偏见,需改进以实现公平翻译。 Conclusion: 需针对性策略确保LLM翻译的包容性,数据集和代码已公开。 Abstract: Large Language Models (LLMs) are increasingly leveraged for translation tasks but often fall short when translating inclusive language -- such as texts containing the singular 'they' pronoun or otherwise reflecting fair linguistic protocols. Because these challenges span both computational and societal domains, it is imperative to critically evaluate how well LLMs handle inclusive translation with a well-founded framework. This paper presents FairTranslate, a novel, fully human-annotated dataset designed to evaluate non-binary gender biases in machine translation systems from English to French. FairTranslate includes 2418 English-French sentence pairs related to occupations, annotated with rich metadata such as the stereotypical alignment of the occupation, grammatical gender indicator ambiguity, and the ground-truth gender label (male, female, or inclusive). We evaluate four leading LLMs (Gemma2-2B, Mistral-7B, Llama3.1-8B, Llama3.3-70B) on this dataset under different prompting procedures. Our results reveal substantial biases in gender representation across LLMs, highlighting persistent challenges in achieving equitable outcomes in machine translation. These findings underscore the need for focused strategies and interventions aimed at ensuring fair and inclusive language usage in LLM-based translation systems. We make the FairTranslate dataset publicly available on Hugging Face, and disclose the code for all experiments on GitHub.

[98] W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

Shang Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为W-PCA的新型零样本神经架构搜索方法,专注于轻量级语言模型的设计与评估,通过参数计数和主成分分析优化效率,显著减少了训练时间并提升了测试性能。

Details Motivation: 现有零样本NAS方法存在评估指标偏差和计算效率低的问题,亟需一种更高效的方法来设计和评估轻量级语言模型。 Method: 采用权重加权PCA(W-PCA)方法,结合参数计数和FFN层主成分分析作为评估代理,避免梯度计算以提升效率。 Result: 在GLUE和SQuAD数据集上的实验表明,该方法显著减少训练时间,测试分数优于现有方法;在FlexiBERT搜索空间中的排名评估也表现更优。 Conclusion: W-PCA方法在轻量级语言模型设计中表现出高效性和优越性,为未来研究提供了新方向。 Abstract: The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $\eta$ in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.

[99] Few-shot Hate Speech Detection Based on the MindSpore Framework

Zhenkai Qin,Dongze Wu,Yuxin Liu,Guifang Yang

Main category: cs.CL

TL;DR: 提出了一种名为MS-FSLHate的少样本仇恨言论检测框架,结合可学习提示嵌入、CNN-BiLSTM架构和对抗数据增强,显著提升了性能。

Details Motivation: 社交媒体仇恨言论泛滥,现有深度学习模型在少样本或低资源环境下性能下降,亟需改进。 Method: 采用可学习提示嵌入、CNN-BiLSTM架构与注意力池化,结合同义词对抗数据增强。 Result: 在HateXplain和HSOL数据集上表现优于基线模型,精度、召回率和F1分数均有提升。 Conclusion: 提示学习与对抗增强结合,适用于资源受限环境,为少样本仇恨言论检测提供了有效解决方案。 Abstract: The proliferation of hate speech on social media poses a significant threat to online communities, requiring effective detection systems. While deep learning models have shown promise, their performance often deteriorates in few-shot or low-resource settings due to reliance on large annotated corpora. To address this, we propose MS-FSLHate, a prompt-enhanced neural framework for few-shot hate speech detection implemented on the MindSpore deep learning platform. The model integrates learnable prompt embeddings, a CNN-BiLSTM backbone with attention pooling, and synonym-based adversarial data augmentation to improve generalization. Experimental results on two benchmark datasets-HateXplain and HSOL-demonstrate that our approach outperforms competitive baselines in precision, recall, and F1-score. Additionally, the framework shows high efficiency and scalability, suggesting its suitability for deployment in resource-constrained environments. These findings highlight the potential of combining prompt-based learning with adversarial augmentation for robust and adaptable hate speech detection in few-shot scenarios.

[100] CAPO: Cost-Aware Prompt Optimization

Tom Zehle,Moritz Schlager,Timo Heiß,Matthias Feurer

Main category: cs.CL

TL;DR: CAPO是一种基于AutoML技术的成本感知提示优化算法,通过进化方法和多目标优化提高效率,减少LLM调用次数和提示长度,性能优于现有方法。

Details Motivation: 当前自动提示优化方法需要大量LLM调用和输入标记,成本高昂,CAPO旨在通过高效算法解决这一问题。 Method: CAPO结合进化方法(以LLM为操作符)、racing节省评估和多目标优化(平衡性能与提示长度),同时优化指令和少样本示例。 Result: 在11/15案例中,CAPO性能优于现有方法,最高提升21%,且在较小预算下表现更好,提示长度更短。 Conclusion: CAPO通过提高成本效率,使提示优化更强大且易于使用,是迈向高效提示优化的重要一步。 Abstract: Large language models (LLMs) have revolutionized natural language processing by solving a wide range of tasks simply guided by a prompt. Yet their performance is highly sensitive to prompt formulation. While automated prompt optimization addresses this challenge by finding optimal prompts, current methods require a substantial number of LLM calls and input tokens, making prompt optimization expensive. We introduce CAPO (Cost-Aware Prompt Optimization), an algorithm that enhances prompt optimization efficiency by integrating AutoML techniques. CAPO is an evolutionary approach with LLMs as operators, incorporating racing to save evaluations and multi-objective optimization to balance performance with prompt length. It jointly optimizes instructions and few-shot examples while leveraging task descriptions for improved robustness. Our extensive experiments across diverse datasets and LLMs demonstrate that CAPO outperforms state-of-the-art discrete prompt optimization methods in 11/15 cases with improvements up to 21%p. Our algorithm achieves better performances already with smaller budgets, saves evaluations through racing, and decreases average prompt length via a length penalty, making it both cost-efficient and cost-aware. Even without few-shot examples, CAPO outperforms its competitors and generally remains robust to initial prompts. CAPO represents an important step toward making prompt optimization more powerful and accessible by improving cost-efficiency.

[101] Methods for Recognizing Nested Terms

Igor Rozhkov,Natalia Loukachevitch

Main category: cs.CL

TL;DR: 本文介绍了在RuTermEval竞赛中应用Binder模型提取嵌套术语的方法,取得了最佳成绩,并研究了从非嵌套标注数据中识别嵌套术语的新任务。

Details Motivation: 探索如何有效提取嵌套术语,尤其是在缺乏嵌套标注数据的情况下。 Method: 应用Binder模型,该模型曾成功用于嵌套命名实体识别。 Result: 在RuTermEval竞赛的三个赛道中均获得最佳术语识别结果。 Conclusion: 提出的方法在无需嵌套标注的情况下也能有效提取嵌套术语。 Abstract: In this paper, we describe our participation in the RuTermEval competition devoted to extracting nested terms. We apply the Binder model, which was previously successfully applied to the recognition of nested named entities, to extract nested terms. We obtained the best results of term recognition in all three tracks of the RuTermEval competition. In addition, we study the new task of recognition of nested terms from flat training data annotated with terms without nestedness. We can conclude that several approaches we proposed in this work are viable enough to retrieve nested terms effectively without nested labeling of them.

Jingyu Zhang,Jiacan Yu,Marc Marone,Benjamin Van Durme,Daniel Khashabi

Main category: cs.CL

TL;DR: 论文提出了一种名为BloomScrub的轻量级推理时方法,用于解决大型语言模型(LLMs)在预训练中接触版权材料后可能引发的侵权风险。该方法通过结合引用检测和重写技术,提供可认证的版权保护。

Details Motivation: 预训练中LLMs接触版权材料可能引发部署后的侵权问题,现有方法对极端风险(如长段逐字引用)效果有限。 Method: 提出BloomScrub方法,结合引用检测和重写技术,利用Bloom过滤器实现高效版权筛查,并在必要时选择不响应。 Result: 实验表明,BloomScrub能显著降低侵权风险,保持模型实用性,并适应不同严格程度的执行要求。 Conclusion: 轻量级的推理时方法在版权预防中具有显著效果。 Abstract: The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of "copyright takedown" methods, post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibits by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.

[103] LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement

Zhifan Ye,Kejing Xia,Yonggan Fu,Xin Dong,Jihoon Hong,Xiangchi Yuan,Shizhe Diao,Jan Kautz,Pavlo Molchanov,Yingyan Celine Lin

Main category: cs.CL

TL;DR: LongMamba是一种无需训练的技术,通过分类和过滤关键token,显著提升Mamba模型的长上下文理解能力。

Details Motivation: 尽管SSMs(如Mamba)在长上下文处理中效率高,但其性能仍不及Transformer。LongMamba旨在解决这一不足。 Method: 通过将Mamba的隐藏通道分为局部和全局通道,并过滤全局通道中的非关键token,防止内存衰减。 Result: LongMamba显著提升了Mamba的长上下文性能,无需额外训练。 Conclusion: LongMamba为Mamba模型的长上下文任务设定了新标准,扩展了其应用范围。 Abstract: State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.

[104] Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability

Daniel Hendriks,Philipp Spitzer,Niklas Kühl,Gerhard Satzger

Main category: cs.CL

TL;DR: 论文探讨了知识蒸馏在大型语言模型(LLM)中的应用,提出了新的蒸馏方法,并比较了其在性能和可解释性上的效果。

Details Motivation: 解决LLM在资源受限环境中的部署问题,研究知识蒸馏方法对模型性能和可解释性的影响。 Method: 应用critique-revision prompting生成训练数据,并综合现有方法训练学生模型,使用CQA数据集进行系统比较。 Result: 提出了新的蒸馏方法,并通过实验比较了其在性能和可解释性上的表现。 Conclusion: 新方法有助于小型语言模型的蒸馏,推动LLM技术的广泛应用和快速扩散。 Abstract: Artificial Intelligence (AI) has increasingly influenced modern society, recently in particular through significant advancements in Large Language Models (LLMs). However, high computational and storage demands of LLMs still limit their deployment in resource-constrained environments. Knowledge distillation addresses this challenge by training a small student model from a larger teacher model. Previous research has introduced several distillation methods for both generating training data and for training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated and compared. In this work, we enlarge the set of available methods by applying critique-revision prompting to distillation for data generation and by synthesizing existing methods for training. For these methods, we provide a systematic comparison based on the widely used Commonsense Question-Answering (CQA) dataset. While we measure performance via student model accuracy, we employ a human-grounded study to evaluate explainability. We contribute new distillation methods and their comparison in terms of both performance and explainability. This should further advance the distillation of small language models and, thus, contribute to broader applicability and faster diffusion of LLM technology.

[105] Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Ziqiao Ma,Jing Ding,Xuejun Zhang,Dezhi Luo,Jiahe Ding,Sihan Xu,Yuchen Huang,Run Peng,Joyce Chai

Main category: cs.CL

TL;DR: 论文从语用学角度重新审视指代表达生成(REG),提出新数据集RefOI,并发现当前视觉语言模型(VLMs)在语用能力上的三大缺陷。

Details Motivation: 当前对视觉语言模型的评估常忽略语用维度,仅将其视为基于区域的描述任务,忽视了Gricean准则。 Method: 引入包含1.5k图像的RefOI数据集,系统评估了最先进的VLMs在语用能力上的表现。 Result: 发现VLMs在语用能力上的三大失败:无法唯一识别指代对象、包含冗余信息、与人类语用偏好不一致。 Conclusion: 呼吁开发更符合人类实际交流的语用模型和评估框架。 Abstract: Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

[106] A Python Tool for Reconstructing Full News Text from GDELT

A. Fronzetti Colladon,R. Vestrelli

Main category: cs.CL

TL;DR: 论文提出了一种利用GDELT数据集低成本获取全文新闻的方法,解决了现有新闻数据集的访问限制问题。

Details Motivation: 新闻数据在多学科研究中至关重要,但现有数据集成本高或不完整,限制了研究。 Method: 利用GDELT Web News NGrams 3.0数据集,通过Python代码重构全文新闻。 Result: 实现了低成本获取结构化、大规模的新闻数据,支持文本分析。 Conclusion: 该方法提升了新闻数据的可访问性,助力经济预测、计算社会科学和自然语言处理研究。 Abstract: News data have become an essential resource across various disciplines, including economics, finance, management, social sciences, and computer science. Researchers leverage newspaper articles to study economic trends, market dynamics, corporate strategies, public perception, political discourse, and the evolution of public opinion. Additionally, news datasets have been instrumental in training large-scale language models, with applications in sentiment analysis, fake news detection, and automated news summarization. Despite their significance, access to comprehensive news corpora remains a key challenge. Many full-text news providers, such as Factiva and LexisNexis, require costly subscriptions, while free alternatives often suffer from incomplete data and transparency issues. This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost by leveraging data from the Global Database of Events, Language, and Tone (GDELT). Specifically, we focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources. We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them. Our method enables researchers to access structured, large-scale newspaper data for text analysis while overcoming the limitations of existing proprietary datasets. The proposed approach enhances the accessibility of news data for empirical research, facilitating applications in economic forecasting, computational social science, and natural language processing.

[107] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

Zhiyuan Hu,Shiyun Xiong,Yifan Zhang,See-Kiong Ng,Anh Tuan Luu,Bo An,Shuicheng Yan,Bryan Hooi

Main category: cs.CL

TL;DR: 提出了一种通过奖励模型在推理时引导视觉语言模型(VLM)代理的方法,显著提升了GUI导航任务的性能。

Details Motivation: 当前VLM在复杂GUI环境中生成正确动作的能力有限,且现有评估和优化技术存在反馈延迟和局部优化问题。 Method: 在推理时通过奖励模型对VLM代理进行过程监督,优化每一步动作,并结合轨迹反思和重试机制。 Result: 在静态环境中单步动作准确率提升3.4%,动态环境中任务成功率提升约33%。 Conclusion: 该方法有效提升了VLM在GUI任务中的性能,尤其在动态环境中表现突出。 Abstract: Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

[108] PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu,Shaoyang Guo,Zhuo-Yang Song,Yunbo Sun,Zeyu Cai,Jiashen Wei,Tianyu Luo,Yixuan Yin,Haoxu Zhang,Yi Hu,Chenyang Wang,Chencheng Tang,Haoling Chang,Qi Liu,Ziheng Zhou,Tianyu Zhang,Jingtian Zhang,Zhangyi Liu,Minghao Li,Yuku Zhang,Boxuan Jing,Xianqi Yin,Yutong Ren,Zizhuo Fu,Weike Wang,Xudong Tian,Anqi Lv,Laifu Man,Jianxiang Li,Feiyu Tao,Qihua Sun,Zhou Liang,Yushu Mu,Zhongxuan Li,Jing-Jun Zhang,Shutao Zhang,Xiaotian Li,Xingqi Xia,Jiawei Lin,Zheyu Shen,Jiahang Chen,Qiuhao Xiong,Binran Wang,Fengyuan Wang,Ziyang Ni,Bohan Zhang,Fan Cui,Changkun Shao,Qing-Hong Cao,Ming-xing Luo,Muhan Zhang,Hua Xing Zhu

Main category: cs.CL

TL;DR: PHYBench是一个用于评估大语言模型在物理场景中推理能力的高质量基准,包含500个精心设计的物理问题,并提出新的评估指标EED Score。测试结果显示,现有模型在复杂物理推理上仍显著落后于人类专家。

Details Motivation: 评估大语言模型在真实物理场景中的推理能力,揭示其局限性并推动改进。 Method: 构建包含500个物理问题的PHYBench基准,并提出基于数学表达式编辑距离的EED Score评估指标。 Result: 现有最先进的推理模型在复杂物理推理上显著落后于人类专家。 Conclusion: PHYBench和EED Score为评估和改进大语言模型的物理推理能力提供了有效工具。 Abstract: We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

[109] TTRL: Test-Time Reinforcement Learning

Yuxin Zuo,Kaiyan Zhang,Shang Qu,Li Sheng,Xuekai Zhu,Biqing Qi,Youbang Sun,Ganqu Cui,Ning Ding,Bowen Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为TTRL的新方法,利用无标签数据通过强化学习训练大型语言模型,显著提升模型性能。

Details Motivation: 研究在缺乏显式标签的情况下,如何通过强化学习优化大型语言模型在推理任务中的表现。 Method: 提出Test-Time Reinforcement Learning (TTRL),利用预训练模型的先验知识,通过多数投票等测试时缩放方法生成奖励信号。 Result: TTRL显著提升了模型性能,例如在AIME 2024任务中,Qwen-2.5-Math-7B的pass@1性能提高了约159%。 Conclusion: TTRL在多种任务中表现出广泛的有效性,展示了其在无标签数据上的潜力。 Abstract: This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

cs.AI [Back]

[110] Learning Adaptive Parallel Reasoning with Language Models

Jiayi Pan,Xiuyu Li,Long Lian,Charlie Snell,Yifei Zhou,Adam Yala,Trevor Darrell,Kurt Keutzer,Alane Suhr

Main category: cs.AI

TL;DR: APR是一种新型推理框架,通过自适应并行推理优化语言模型的计算分配,显著提升性能和可扩展性。

Details Motivation: 现有推理方法存在输出过长或协调不足的问题,限制了语言模型的推理能力和效率。 Method: 提出APR框架,结合串行和并行计算,使用spawn()和join()操作,并通过强化学习优化推理线程。 Result: 在Countdown任务中,APR在相同上下文窗口、计算量和延迟下均表现更优。 Conclusion: APR为语言模型自适应优化推理过程提供了新方向。 Abstract: Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

[111] TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

Daocheng Fu,Zijun Chen,Renqiu Xia,Qi Liu,Yuan Feng,Hongbin Zhou,Renrui Zhang,Shiyang Feng,Peng Gao,Junchi Yan,Botian Shi,Bo Zhang,Yu Qiao

Main category: cs.AI

TL;DR: 论文提出了一种名为TrustGeoGen的数据引擎,用于生成几何问题,并通过形式化验证提供基准,解决了现有合成几何问题基准的噪声和自相矛盾问题。

Details Motivation: 解决几何问题求解中多模态信息整合和逻辑一致性的挑战,填补现有基准在方法论和验证上的不足。 Method: 提出TrustGeoGen引擎,通过多模态对齐生成、形式化验证、自举机制和GeoExplore算法,生成高质量几何问题数据集。 Result: 生成GeoTrust-200K数据集和测试集,实验显示现有模型在测试集上准确率仅为49.17%,但训练后的模型在OOD泛化上表现优异。 Conclusion: TrustGeoGen为几何问题求解提供了可靠的基准和方法,显著减少了逻辑不一致性。 Abstract: Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at https://github.com/Alpha-Innovator/TrustGeoGen

[112] AGI Is Coming... Right After AI Learns to Play Wordle

Sarath Shekkizhar,Romain Cosentino

Main category: cs.AI

TL;DR: 研究评估了OpenAI的计算机用户代理(CUA)在Wordle游戏中的表现,发现其在颜色识别上存在显著问题,成功率仅为5.36%。

Details Motivation: 探讨多模态代理(如CUA)在简单任务中的表现,揭示当前前沿AI模型的局限性。 Method: 通过让CUA在纽约时报Wordle游戏中完成任务,分析其行为和缺陷。 Result: 模型在颜色识别上表现不佳,成功率低,表明简单任务对AI仍具挑战性。 Conclusion: 讨论了潜在原因、未来发展的影响及改进AI系统的研究方向。 Abstract: This paper investigates multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model's ability to recognize colors correctly depending on the context. The model had a $5.36\%$ success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today's frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.

cs.AR [Back]

[113] VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation

Anjiang Wei,Huanmi Tan,Tarun Suresh,Daniel Mendoza,Thiago S. F. X. Teixeira,Ke Wang,Caroline Trippel,Alex Aiken

Main category: cs.AR

TL;DR: VERICODER是一个针对RTL代码生成的模型,通过功能验证的数据集微调,显著提升了功能正确性。

Details Motivation: 现有RTL数据集多关注语法有效性而非功能验证,导致生成的代码可能不符合预期行为。 Method: 结合单元测试生成和反馈导向的优化方法,构建功能验证的数据集,并微调模型。 Result: VERICODER在VerilogEval和RTLLM上达到最先进的功能正确性指标,相对提升高达71.7%和27.4%。 Conclusion: 功能验证的高质量数据集对RTL代码生成至关重要。 Abstract: Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of over 125,000 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4% respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation.

cs.CR [Back]

[114] A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Kun Wang,Guibin Zhang,Zhenhong Zhou,Jiahao Wu,Miao Yu,Shiqian Zhao,Chenlong Yin,Jinhu Fu,Yibo Yan,Hanjun Luo,Liang Lin,Zhihao Xu,Haolang Lu,Xinye Cao,Xinyun Zhou,Weifei Jin,Fanci Meng,Junyuan Mao,Hao Wu,Minghe Wang,Fan Zhang,Junfeng Fang,Chengwei Liu,Yifan Zhang,Qiankun Li,Chongye Guo,Yalan Qin,Yi Ding,Donghai Hong,Jiaming Ji,Xinfeng Li,Yifan Jiang,Dongxia Wang,Yihao Huang,Yufei Guo,Jen-tse Huang,Yanwei Yue,Wenke Huang,Guancheng Wan,Tianlin Li,Lei Bai,Jie Zhang,Qing Guo,Jingyi Wang,Tianlong Chen,Joey Tianyi Zhou,Xiaojun Jia,Weisong Sun,Cong Wu,Jing Chen,Xuming Hu,Yiming Li,Xiao Wang,Ningyu Zhang,Luu Anh Tuan,Guowen Xu,Tianwei Zhang,Xingjun Ma,Xiang Wang,Bo An,Jun Sun,Mohit Bansal,Shirui Pan,Yuval Elovici,Bhavya Kailkhura,Bo Li,Yaodong Yang,Hongwei Li,Wenyuan Xu,Yizhou Sun,Wei Wang,Qing Li,Ke Tang,Yu-Gang Jiang,Felix Juefei-Xu,Hui Xiong,Xiaofeng Wang,Shuicheng Yan,Dacheng Tao,Philip S. Yu,Qingsong Wen,Yang Liu

Main category: cs.CR

TL;DR: 本文提出“全栈安全”概念,系统性地分析大型语言模型(LLM)全生命周期的安全问题,填补现有研究的空白。

Details Motivation: 现有关于LLM安全的研究多集中于特定阶段,缺乏对全生命周期的全面理解,本文旨在填补这一空白。 Method: 通过定义LLM全生命周期(数据准备、预训练、后训练、部署和商业化),并基于800+文献的系统性分析,提出安全框架。 Result: 研究提供了全面的安全视角、广泛的文献支持和独特见解,包括数据生成安全、对齐技术、模型编辑等方向。 Conclusion: 本文为LLM全生命周期安全研究提供了系统性指导,并指出了未来研究方向。 Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full-stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

physics.med-ph [Back]

[115] Fluorescence Reference Target Quantitative Analysis Library

Eammon A. Littler,Emmanuel A. Mannoh,Ethan P. M. LaRochelle

Main category: physics.med-ph

TL;DR: QUEL-QAL是一个开源的Python库,旨在标准化荧光成像系统的性能评估,支持关键指标分析并提高透明度和可重复性。

Details Motivation: 荧光引导手术(FGS)领域缺乏标准化的性能评估工具,现有方法不一致且难以获取。 Method: 开发了QUEL-QAL库,提供模块化工作流,包括ROI检测、统计分析和可视化功能。 Result: 支持响应线性、检测限、深度敏感性和空间分辨率等关键指标,符合监管和学术要求。 Conclusion: QUEL-QAL为标准化荧光成像系统评估提供了基础工具,促进透明度和开发效率。 Abstract: Standardized performance evaluation of fluorescence imaging systems remains a critical unmet need in the field of fluorescence-guided surgery (FGS). While the American Association of Physicists in Medicine (AAPM) TG311 report and recent FDA draft guidance provide recommended metrics for system characterization, practical tools for extracting these metrics remain limited, inconsistent, and often inaccessible. We present QUEL-QAL, an open-source Python library designed to streamline and standardize the quantitative analysis of fluorescence images using solid reference targets. The library provides a modular, reproducible workflow that includes region of interest (ROI) detection, statistical analysis, and visualization capabilities. QUEL-QAL supports key metrics such as response linearity, limit of detection, depth sensitivity, and spatial resolution, in alignment with regulatory and academic guidance. Built on widely adopted Python packages, the library is designed to be extensible, enabling users to adapt it to novel target designs and analysis protocols. By promoting transparency, reproducibility, and regulatory alignment, QUEL-QAL offers a foundational tool to support standardized benchmarking and accelerate the development and evaluation of fluorescence imaging systems.

cs.FL [Back]

[116] A New Graph Grammar Formalism for Robust Syntactic Pattern Recognition

Peter Fletcher

Main category: cs.FL

TL;DR: 本文提出了一种直接且声明式的方法来表示递归结构的图模式语法,避免了传统图语法的产生式规则,将语法和模式表示为网络,并将解析视为从模式到语法的同态构造。

Details Motivation: 传统图语法使用产生式规则,而本文旨在通过更直接和声明式的方法表示递归结构的图模式语法,以支持并行解析和多维递归结构的表示。 Method: 将语法和模式表示为网络,解析过程构造从模式到语法的同态,支持并行解析并集成特征检测、分割、解析等步骤。 Result: 该方法能够容忍错误并解析复杂的递归结构模式(50-1000符号),处理几何关系变化、模糊符号、重叠符号、杂乱图像和缺失区域。 Conclusion: 本文的理论框架为递归结构图模式的解析提供了高效且并行化的方法,并通过实例验证了其处理复杂模式的能力。 Abstract: I introduce a formalism for representing the syntax of recursively structured graph-like patterns. It does not use production rules, like a conventional graph grammar, but represents the syntactic structure in a more direct and declarative way. The grammar and the pattern are both represented as networks, and parsing is seen as the construction of a homomorphism from the pattern to the grammar. The grammars can represent iterative, hierarchical and nested recursive structure in more than one dimension. This supports a highly parallel style of parsing, in which all aspects of pattern recognition (feature detection, segmentation, parsing, filling in missing symbols, top-down and bottom-up inference) are integrated into a single process, to exploit the synergy between them. The emphasis of this paper is on underlying theoretical issues, but I also give some example runs to illustrate the error-tolerant parsing of complex recursively structured patterns of 50-1000 symbols, involving variability in geometric relationships, blurry and indistinct symbols, overlapping symbols, cluttered images, and erased patches.

cs.IR [Back]

[117] Med-CoDE: Medical Critique based Disagreement Evaluation Framework

Mohit Gupta,Akiko Aizawa,Rajiv Ratn Shah

Main category: cs.IR

TL;DR: 提出Med-CoDE框架,用于评估医疗领域大语言模型的可靠性和准确性。

Details Motivation: 当前评估方法在医疗领域缺乏鲁棒性,无法全面评估大语言模型的性能,可能导致临床风险。 Method: 采用基于批评的方法,量化模型生成回答与医学标准答案之间的差异。 Result: 框架能同时评估准确性和可靠性,填补现有评估方法的不足。 Conclusion: Med-CoDE为医疗大语言模型提供了一种全面可靠的评估方法。 Abstract: The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.

[118] CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction

Harsh Maheshwari,Srikanth Tenneti,Alwarappan Nakkiran

Main category: cs.IR

TL;DR: RAG结合LLMs提升信息检索,但引用准确性低(约74%)。本文提出后处理算法,通过关键词+语义匹配、BERTScore微调模型和轻量级LLM技术,提升15.46%的准确性,降低成本并加速推理。

Details Motivation: 解决LLM在RAG系统中引用准确性低的问题,提升AI生成内容的可靠性和信任度。 Method: 采用关键词+语义匹配、BERTScore微调模型和轻量级LLM技术进行后处理。 Result: 引用准确性提升15.46%,可改用更小、更经济高效的模型(成本降12倍,推理快3倍)。 Conclusion: 研究显著提升RAG系统的引用准确性,降低成本,增强AI生成内容的可靠性,适用于商业产品。 Abstract: Retrieval Augmented Generation (RAG) has emerged as a powerful application of Large Language Models (LLMs), revolutionizing information search and consumption. RAG systems combine traditional search capabilities with LLMs to generate comprehensive answers to user queries, ideally with accurate citations. However, in our experience of developing a RAG product, LLMs often struggle with source attribution, aligning with other industry studies reporting citation accuracy rates of only about 74% for popular generative search engines. To address this, we present efficient post-processing algorithms to improve citation accuracy in LLM-generated responses, with minimal impact on latency and cost. Our approaches cross-check generated citations against retrieved articles using methods including keyword + semantic matching, fine tuned model with BERTScore, and a lightweight LLM-based technique. Our experimental results demonstrate a relative improvement of 15.46% in the overall accuracy metrics of our RAG system. This significant enhancement potentially enables a shift from our current larger language model to a relatively smaller model that is approximately 12x more cost-effective and 3x faster in inference time, while maintaining comparable performance. This research contributes to enhancing the reliability and trustworthiness of AI-generated content in information retrieval and summarization tasks which is critical to gain customer trust especially in commercial products.

stat.ML [Back]

[119] How Private is Your Attention? Bridging Privacy with In-Context Learning

Soham Bonnerjee,Zhen Wei,Yeon,Anna Asch,Sagnik Nandy,Promit Ghosal

Main category: stat.ML

TL;DR: 本文研究了在形式隐私约束下上下文学习(ICL)的可行性,提出了一种差分隐私预训练算法,并首次对线性回归中ICL的隐私-准确性权衡进行了理论分析。

Details Motivation: 探索在隐私约束下ICL的可行性,填补现有研究的空白。 Method: 提出差分隐私预训练算法,分析线性回归中ICL的隐私-准确性权衡。 Result: 揭示了优化与隐私噪声之间的基本矛盾,并证明方法对对抗性训练提示具有鲁棒性。 Conclusion: 理论分析和实验验证表明,该方法在隐私保护下仍能有效实现ICL。 Abstract: In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints remains largely unexplored. In this paper, we propose a differentially private pretraining algorithm for linear attention heads and present the first theoretical analysis of the privacy-accuracy trade-off for ICL in linear regression. Our results characterize the fundamental tension between optimization and privacy-induced noise, formally capturing behaviors observed in private training via iterative methods. Additionally, we show that our method is robust to adversarial perturbations of training prompts, unlike standard ridge regression. All theoretical findings are supported by extensive simulations across diverse settings.

cs.RO [Back]

[120] SLAM-Based Navigation and Fault Resilience in a Surveillance Quadcopter with Embedded Vision Systems

Abhishek Tyagi,Charu Gaur

Main category: cs.RO

TL;DR: Veg是一个自主空中监视平台,集成了视觉SLAM、先进控制架构和嵌入式视觉模块,支持GPS独立导航、动态稳定性和实时物体/人脸识别。

Details Motivation: 设计一个故障容忍的无人机系统,适用于受限环境,整合实时定位、故障恢复和嵌入式AI。 Method: 采用级联控制设计(LQR内环和PD外环),ORB-SLAM3进行6-DoF定位,Dijkstra路径规划,以及轻量级CNN和PCA的嵌入式视觉系统。 Result: 通过仿真和实际测试验证,平台实现了实时定位、故障检测与恢复,以及高精度物体/人脸识别。 Conclusion: Veg平台成功整合了多项技术,适用于复杂环境下的自主监视任务。 Abstract: We present an autonomous aerial surveillance platform, Veg, designed as a fault-tolerant quadcopter system that integrates visual SLAM for GPS-independent navigation, advanced control architecture for dynamic stability, and embedded vision modules for real-time object and face recognition. The platform features a cascaded control design with an LQR inner-loop and PD outer-loop trajectory control. It leverages ORB-SLAM3 for 6-DoF localization and loop closure, and supports waypoint-based navigation through Dijkstra path planning over SLAM-derived maps. A real-time Failure Detection and Identification (FDI) system detects rotor faults and executes emergency landing through re-routing. The embedded vision system, based on a lightweight CNN and PCA, enables onboard object detection and face recognition with high precision. The drone operates fully onboard using a Raspberry Pi 4 and Arduino Nano, validated through simulations and real-world testing. This work consolidates real-time localization, fault recovery, and embedded AI on a single platform suitable for constrained environments.

[121] A Vision-Enabled Prosthetic Hand for Children with Upper Limb Disabilities

Md Abdul Baset Sarker,Art Nguyen,Sigmond Kukla,Kevin Fite,Masudul H. Imtiaz

Main category: cs.RO

TL;DR: 本文介绍了一种新型AI视觉辅助儿童假肢手,专为10-12岁上肢残疾儿童设计,具有低成本、轻量化和仿生功能。

Details Motivation: 解决现有肌电假肢的高成本和功能限制问题,为低收入家庭提供可负担的解决方案。 Method: 结合3D打印技术、机器视觉、传感和嵌入式计算,采用低功耗FPGA和深度学习模型实现实时物体检测和精确抓取。 Result: 物体检测和抓取分类模型的准确率分别达到96%和100%,力预测的平均绝对误差为0.018。 Conclusion: 该假肢手通过AI视觉和低功耗设计,实现了高性能和广泛适用性,为儿童假肢领域提供了创新解决方案。 Abstract: This paper introduces a novel AI vision-enabled pediatric prosthetic hand designed to assist children aged 10-12 with upper limb disabilities. The prosthesis features an anthropomorphic appearance, multi-articulating functionality, and a lightweight design that mimics a natural hand, making it both accessible and affordable for low-income families. Using 3D printing technology and integrating advanced machine vision, sensing, and embedded computing, the prosthetic hand offers a low-cost, customizable solution that addresses the limitations of current myoelectric prostheses. A micro camera is interfaced with a low-power FPGA for real-time object detection and assists with precise grasping. The onboard DL-based object detection and grasp classification models achieved accuracies of 96% and 100% respectively. In the force prediction, the mean absolute error was found to be 0.018. The features of the proposed prosthetic hand can thus be summarized as: a) a wrist-mounted micro camera for artificial sensing, enabling a wide range of hand-based tasks; b) real-time object detection and distance estimation for precise grasping; and c) ultra-low-power operation that delivers high performance within constrained power and resource limits.

[122] RaSCL: Radar to Satellite Crossview Localization

Blerim Abdullai,Tony Wang,Xinyuan Qiao,Florian Shkurti,Timothy D. Barfoot

Main category: cs.RO

TL;DR: 提出了一种不依赖GNSS的全局定位方法,通过地面雷达与高空RGB图像的配准,结合里程计和全局位姿优化,实现高效定位。

Details Motivation: GNSS在许多实时自主应用中不可靠、不准确且不足,需要一种替代的全局定位方案。 Method: 通过地面雷达与高空RGB图像的配准,结合里程计和全局位姿优化,提取RGB图像中的关键特征进行定位。 Result: 在多种地理条件和机器人平台上(如无人水面艇、城市和郊区驾驶数据集)验证了方法的有效性。 Conclusion: 该方法提供了一种不依赖GNSS的高效全局定位解决方案,适用于多样化场景。 Abstract: GNSS is unreliable, inaccurate, and insufficient in many real-time autonomous field applications. In this work, we present a GNSS-free global localization solution that contains a method of registering imaging radar on the ground with overhead RGB imagery, with joint optimization of relative poses from odometry and global poses from our overhead registration. Previous works have used various combinations of ground sensors and overhead imagery, and different feature extraction and matching methods. These include various handcrafted and deep-learning-based methods for extracting features from overhead imagery. Our work presents insights on extracting essential features from RGB overhead images for effective global localization against overhead imagery using only ground radar and a single georeferenced initial guess. We motivate our method by evaluating it on datasets in diverse geographic conditions and robotic platforms, including on an Unmanned Surface Vessel (USV) as well as urban and suburban driving datasets.

[123] Visual Place Cell Encoding: A Computational Model for Spatial Representation and Cognitive Mapping

Chance J. Hamilton,Alfredo Weitzenfeld

Main category: cs.RO

TL;DR: VPCE模型通过视觉输入模拟位置细胞激活,利用视觉地标进行空间编码,实验表明其能区分相似但空间不同的位置,并适应环境变化。

Details Motivation: 探索视觉地标在空间编码中的核心作用,验证仅凭视觉输入是否能生成类似生物位置细胞的空间表征。 Method: 通过机器人摄像头捕获图像,提取高维外观特征并聚类,使用径向基函数计算激活,评估其与生物位置细胞特性的相关性。 Result: VPCE能区分空间不同的视觉相似位置,适应环境变化(如墙壁增减),生成类似生物位置细胞的激活模式。 Conclusion: 结构化视觉输入足以生成位置细胞样空间表征,支持生物启发的认知映射,无需运动线索或奖励驱动学习。 Abstract: This paper presents the Visual Place Cell Encoding (VPCE) model, a biologically inspired computational framework for simulating place cell-like activation using visual input. Drawing on evidence that visual landmarks play a central role in spatial encoding, the proposed VPCE model activates visual place cells by clustering high-dimensional appearance features extracted from images captured by a robot-mounted camera. Each cluster center defines a receptive field, and activation is computed based on visual similarity using a radial basis function. We evaluate whether the resulting activation patterns correlate with key properties of biological place cells, including spatial proximity, orientation alignment, and boundary differentiation. Experiments demonstrate that the VPCE can distinguish between visually similar yet spatially distinct locations and adapt to environment changes such as the insertion or removal of walls. These results suggest that structured visual input, even in the absence of motion cues or reward-driven learning, is sufficient to generate place-cell-like spatial representations and support biologically inspired cognitive mapping.

[124] ForesightNav: Learning Scene Imagination for Efficient Exploration

Hardik Shah,Jiaxu Xing,Nico Messikommer,Boyang Sun,Marc Pollefeys,Davide Scaramuzza

Main category: cs.RO

TL;DR: ForesightNav是一种受人类想象和推理启发的探索策略,通过预测未探索区域的上下文信息(如占用和语义细节),显著提升机器人在未知环境中的探索效率。

Details Motivation: 研究人类如何利用先验知识在未知环境中导航,以开发具备类似能力的自主机器人。 Method: 提出ForesightNav,赋予机器人预测未探索区域上下文信息的能力,从而选择有意义的长期导航目标。 Result: 在Structured3D数据集上验证,展示了准确的占用预测和优越的场景几何预测性能,探索效率显著提升。 Conclusion: 想象驱动的推理能增强自主系统的泛化能力和探索效率。 Abstract: Understanding how humans leverage prior knowledge to navigate unseen environments while making exploratory decisions is essential for developing autonomous robots with similar abilities. In this work, we propose ForesightNav, a novel exploration strategy inspired by human imagination and reasoning. Our approach equips robotic agents with the capability to predict contextual information, such as occupancy and semantic details, for unexplored regions. These predictions enable the robot to efficiently select meaningful long-term navigation goals, significantly enhancing exploration in unseen environments. We validate our imagination-based approach using the Structured3D dataset, demonstrating accurate occupancy prediction and superior performance in anticipating unseen scene geometry. Our experiments show that the imagination module improves exploration efficiency in unseen environments, achieving a 100% completion rate for PointNav and an SPL of 67% for ObjectNav on the Structured3D Validation split. These contributions demonstrate the power of imagination-driven reasoning for autonomous systems to enhance generalizable and efficient exploration.

eess.IV [Back]

[125] Enhancing DR Classification with Swin Transformer and Shifted Window Attention

Meher Boulaabi,Takwa Ben Aïcha Gader,Afef Kacem Echi,Zied Bouraoui

Main category: eess.IV

TL;DR: 提出了一种结合图像预处理和Swin Transformer的糖尿病视网膜病变分类方法,在Aptos和IDRiD数据集上分别达到89.65%和97.40%的准确率。

Details Motivation: 糖尿病视网膜病变(DR)是全球致盲的主要原因,早期检测对治疗至关重要,但自动化分类面临图像质量差异、类别不平衡和像素级相似性等挑战。 Method: 采用图像裁剪、CLAHE增强和目标数据增强的预处理流程,结合Swin Transformer的分层令牌处理和移位窗口注意力机制。 Result: 在Aptos和IDRiD数据集上分别实现89.65%和97.40%的准确率,尤其在早期DR检测中表现突出。 Conclusion: 该方法展示了在临床自动化视网膜筛查中的潜力,尤其在早期DR检测方面效果显著。 Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, underscoring the importance of early detection for effective treatment. However, automated DR classification remains challenging due to variations in image quality, class imbalance, and pixel-level similarities that hinder model training. To address these issues, we propose a robust preprocessing pipeline incorporating image cropping, Contrast-Limited Adaptive Histogram Equalization (CLAHE), and targeted data augmentation to improve model generalization and resilience. Our approach leverages the Swin Transformer, which utilizes hierarchical token processing and shifted window attention to efficiently capture fine-grained features while maintaining linear computational complexity. We validate our method on the Aptos and IDRiD datasets for multi-class DR classification, achieving accuracy rates of 89.65% and 97.40%, respectively. These results demonstrate the effectiveness of our model, particularly in detecting early-stage DR, highlighting its potential for improving automated retinal screening in clinical settings.

[126] Split-quaternions for perceptual white balance

Michel Berthier,Nicoletta Prencipe,Edoardo Provenzi

Main category: eess.IV

TL;DR: 提出一种基于分裂四元数的感知色适应变换,用于白平衡,并与传统方法进行定量比较。

Details Motivation: 受量子化颜色感知模型的启发,探索代数结构与分裂四元数的联系。 Method: 利用分裂四元数乘法实现色适应变换。 Result: 展示了该方法在彩色图像处理中的潜力,并与von Kries方法进行了比较。 Conclusion: 分裂四元数方法在色适应变换中具有应用潜力。 Abstract: We propose a perceptual chromatic adaptation transform for white balance that makes use of split-quaternions. The novelty of the present work, which is motivated by a recently developed quantum-like model of color perception, consists at stressing the link between the algebraic structures appearing in this model and a certain sub-algebra of the split-quaternions. We show the potentiality of this approach for color image processing applications by proposing a chromatic adaptation transform, implemented via an appropriate use of the split-quaternion multiplication. Moreover, quantitative comparisons with the widely used state-of-the art von Kries chromatic adaptation transform are provided.

[127] VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining

Zizhi Chen,Xinyu Zhang,Minghao Han,Yizhou Liu,Ziyun Qian,Weifeng Zhang,Xukun Zhang,Jingwei Wei,Lihua Zhang

Main category: eess.IV

TL;DR: 论文提出了一种基于病理视觉语言大模型(VLM)的虚拟染色方法,结合对比学习提示和概念锚点,解决了传统虚拟染色中忽略病理知识和物理特性的问题。

Details Motivation: 传统虚拟染色方法仅实现风格迁移,忽略了组织切片的基本视觉特征和染色剂的物理特性,导致结果不理想。 Method: 引入病理VLM作为辅助工具,结合对比学习提示、组织基础概念锚点和染色特异性概念锚点,开发了基于VLM约束的数据增强方法。 Result: 在公开数据集上验证,生成的图像高度真实,并提升了肾小球检测和分割等下游任务的准确性。 Conclusion: 该方法通过整合病理知识和物理特性,显著提升了虚拟染色的效果和下游任务的性能。 Abstract: In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR

[128] RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution

Biao Wu,Diankai Zhang,Shaoli Liu,Si Gao,Chengjian Zheng,Ning Wang

Main category: eess.IV

TL;DR: 提出了一种名为RepNet-VSR的高保真视频超分辨率方法,旨在解决实时4x视频超分辨率任务中的计算效率问题。

Details Motivation: 视频超分辨率在资源受限的边缘设备上部署时面临计算密集型的挑战,尤其是在实时移动视频处理场景中。 Method: 采用可重参数化架构(RepNet-VSR),优化计算效率。 Result: 在REDS验证集上,处理180p到720p帧时达到27.79 dB PSNR,每10帧耗时103毫秒。 Conclusion: RepNet-VSR在恢复质量和部署效率之间实现了优异平衡,性能优于之前的冠军算法。 Abstract: As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for resource-constrained edge devices, particularly in real-time mobile video processing scenarios where power efficiency and latency constraints coexist. In this work, we propose a Reparameterizable Architecture for High Fidelity Video Super Resolution method, named RepNet-VSR, for real-time 4x video super-resolution. On the REDS validation set, the proposed model achieves 27.79 dB PSNR when processing 180p to 720p frames in 103 ms per 10 frames on a MediaTek Dimensity NPU. The competition results demonstrate an excellent balance between restoration quality and deployment efficiency. The proposed method scores higher than the previous champion algorithm of MAI video super-resolution challenge.

[129] Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

Jingchen Zou,Jianqiang Li,Gabriel Jimenez,Qing Zhao,Daniel Racoceanu,Matias Cosarinsky,Enzo Ferrante,Guanghui Fu

Main category: eess.IV

TL;DR: 提出了一种无需标注数据即可评估医学图像分割模型性能的框架SPE,适用于多种指标和模型架构,实验证明其高效且可靠。

Details Motivation: 在临床等实际场景中,标注所有数据不现实,导致模型性能评估困难。 Method: 开发了Segmentation Performance Evaluator (SPE)框架,支持多种评估指标和模型架构。 Result: 在六个公开数据集上验证,SPE与真实Dice分数相关性高(0.956±0.046),MAE低(0.025±0.019)。 Conclusion: SPE能可靠估计模型性能,无需标注,便于医学图像分割算法的实际应用。 Abstract: The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available

cs.HC [Back]

[130] Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence

Baichuan Zeng

Main category: cs.HC

TL;DR: 本文综述了扩展现实(XR)技术的发展,涵盖硬件、软件及最新产品,强调空间智能的重要性,并展望了AI与IoT驱动的未来方向。

Details Motivation: 探讨XR技术的演变及其在物理与虚拟世界中的桥梁作用,分析其未来潜力。 Method: 通过分析XR的硬件、软件框架及现有产品性能,结合空间智能需求,提出未来发展方向。 Result: XR技术需结合多模态AI和IoT驱动的数字孪生,以实现自适应系统,提升用户体验。 Conclusion: AI将成为XR技术发展的关键,推动其成为人机交互的新前沿。 Abstract: Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world and it has diverse potential which will be ubiquitous in the future. This review examines XR's evolution through foundational framework - hardware ranging from monitors to sensors and software ranging from visual tasks to user interface; highlights state of the art (SOTA) XR products with the comparison and analysis of performance based on their foundational framework; discusses how commercial XR devices can support the demand of high-quality performance focusing on spatial intelligence. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems. With the concept of spatial intelligence, future XR should establish a new digital space with realistic experience that benefits humanity. This review underscores the pivotal role of AI in unlocking XR as the next frontier in human-computer interaction.

cs.LG [Back]

[131] HyperFlow: Gradient-Free Emulation of Few-Shot Fine-Tuning

Donggyun Kim,Chanwoo Kim,Seunghoon Hong

Main category: cs.LG

TL;DR: 提出了一种无需计算梯度的测试时自适应方法,通过模拟梯度下降实现高效适应,显著降低了计算和内存成本。

Details Motivation: 解决测试时微调在实时或低资源场景中因多次反向传播步骤导致的高成本问题。 Method: 将梯度下降建模为ODE的欧拉离散化,训练辅助网络预测任务条件漂移,仅需少量前向传播即可完成适应。 Result: 在跨域少样本分类任务中,性能显著优于基线,计算时间和内存成本仅为标准微调的0.02%和6%。 Conclusion: 该方法在直接迁移和完全微调之间提供了实用的折中方案。 Abstract: While test-time fine-tuning is beneficial in few-shot learning, the need for multiple backpropagation steps can be prohibitively expensive in real-time or low-resource scenarios. To address this limitation, we propose an approach that emulates gradient descent without computing gradients, enabling efficient test-time adaptation. Specifically, we formulate gradient descent as an Euler discretization of an ordinary differential equation (ODE) and train an auxiliary network to predict the task-conditional drift using only the few-shot support set. The adaptation then reduces to a simple numerical integration (e.g., via the Euler method), which requires only a few forward passes of the auxiliary network -- no gradients or forward passes of the target model are needed. In experiments on cross-domain few-shot classification using the Meta-Dataset and CDFSL benchmarks, our method significantly improves out-of-domain performance over the non-fine-tuned baseline while incurring only 6\% of the memory cost and 0.02\% of the computation time of standard fine-tuning, thus establishing a practical middle ground between direct transfer and fully fine-tuned approaches.

[132] Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Jeremy Goldwasser,Giles Hooker

Main category: cs.LG

TL;DR: 论文提出了一种新的反事实图像生成框架,避免传统梯度方法生成对抗样本的问题,并结合特征归因提供解释。

Details Motivation: 解决计算机视觉模型中反事实解释生成困难的问题,尤其是避免生成对抗样本。 Method: 提出Counterfactual Attacks方法,通过低维流形上的表示攻击生成反事实图像,并结合辅助数据集的特征归因量化变化。 Result: 在MNIST和CelebA数据集上验证了方法的有效性,并展示了计算效率。 Conclusion: 该方法为视觉模型的反事实解释提供了一种灵活且高效的解决方案。 Abstract: Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

[133] Bayesian Autoencoder for Medical Anomaly Detection: Uncertainty-Aware Approach for Brain 2 MRI Analysis

Dip Roy

Main category: cs.LG

TL;DR: 本文提出了一种基于贝叶斯变分自编码器(VAE)和多头注意力机制的模型,用于脑部MRI中的异常检测,通过估计不确定性提升了性能。

Details Motivation: 医学影像中的异常检测对神经系统疾病至关重要,但传统确定性方法难以捕捉任务中的不确定性。 Method: 采用贝叶斯变分自编码器(VAE)结合多头注意力机制,通过贝叶斯推理估计认知和随机不确定性。 Result: 在BraTS2020数据集上测试,ROC AUC和PR AUC均为0.83。 Conclusion: 建模不确定性是异常检测的关键,提升了性能和可解释性,为临床决策提供了信心估计。 Abstract: In medical imaging, anomaly detection is a vital element of healthcare diagnostics, especially for neurological conditions which can be life-threatening. Conventional deterministic methods often fall short when it comes to capturing the inherent uncertainty of anomaly detection tasks. This paper introduces a Bayesian Variational Autoencoder (VAE) equipped with multi-head attention mechanisms for detecting anomalies in brain magnetic resonance imaging (MRI). For the purpose of improving anomaly detection performance, we incorporate both epistemic and aleatoric uncertainty estimation through Bayesian inference. The model was tested on the BraTS2020 dataset, and the findings were a 0.83 ROC AUC and a 0.83 PR AUC. The data in our paper suggests that modeling uncertainty is an essential component of anomaly detection, enhancing both performance and interpretability and providing confidence estimates, as well as anomaly predictions, for clinicians to leverage in making medical decisions.

[134] Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

Tatsuhito Hasegawa,Shunsuke Sakai

Main category: cs.LG

TL;DR: 研究发现softmax温度参数$T^*$与特征维度相关,提出无需训练的确定方法,并通过实验验证其有效性。

Details Motivation: 探讨温度参数$T^*$对分类任务的影响及其确定方法,解决实际应用中$T^*$波动的问题。 Method: 提出基于特征维度的温度确定系数,加入批归一化层稳定特征空间,并通过实验优化公式。 Result: 推导的温度参数与理论一致,能提升分类性能,适用于多种任务。 Conclusion: 研究提供了一种无需训练的温度参数确定方法,具有理论和实践价值。 Abstract: In deep learning-based classification tasks, the softmax function's temperature parameter $T$ critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature $T^*$ is uniquely determined by the dimensionality of the feature representations, thereby enabling training-free determination of $T^*$. Despite this theoretical grounding, empirical evidence reveals that $T^*$ fluctuates under practical conditions owing to variations in models, datasets, and other confounding factors. To address these influences, we propose and optimize a set of temperature determination coefficients that specify how $T^*$ should be adjusted based on the theoretical relationship to feature dimensionality. Additionally, we insert a batch normalization layer immediately before the output layer, effectively stabilizing the feature space. Building on these coefficients and a suite of large-scale experiments, we develop an empirical formula to estimate $T^*$ without additional training while also introducing a corrective scheme to refine $T^*$ based on the number of classes and task complexity. Our findings confirm that the derived temperature not only aligns with the proposed theoretical perspective but also generalizes effectively across diverse tasks, consistently enhancing classification performance and offering a practical, training-free solution for determining $T^*$.

[135] SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

Kai Chen,Xiaodong Zhao,Yujie Huang,Guoyu Fang,Xiao Song,Ruiping Wang,Ziyuan Wang

Main category: cs.LG

TL;DR: SocialMOIF提出了一种多阶意图融合模型,用于改进智能系统中代理轨迹的分析和预测,通过结合直接和间接意图信息,显著提升了预测精度。

Details Motivation: 当前代理轨迹预测存在高不确定性和复杂高阶交互的局限性,需要更全面的意图理解。 Method: 提出多阶意图融合模型,设计轨迹分布近似器和全局轨迹优化器,引入考虑距离和方向的损失函数。 Result: 实验表明,模型在动态和静态数据集上均优于现有基线。 Conclusion: SocialMOIF通过融合多阶意图信息,显著提升了轨迹预测的准确性和可解释性。 Abstract: The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to the inherent high uncertainty of agent intentions and the complex higher-order influences among neighboring groups. SocialMOIF is proposed to tackle these challenges, concentrating on the higher-order intention interactions among neighboring groups while reinforcing the primary role of first-order intention interactions between neighbors and the target agent. This method develops a multi-order intention fusion model to achieve a more comprehensive understanding of both direct and indirect intention information. Within SocialMOIF, a trajectory distribution approximator is designed to guide the trajectories toward values that align more closely with the actual data, thereby enhancing model interpretability. Furthermore, a global trajectory optimizer is introduced to enable more accurate and efficient parallel predictions. By incorporating a novel loss function that accounts for distance and direction during training, experimental results demonstrate that the model outperforms previous state-of-the-art baselines across multiple metrics in both dynamic and static datasets.

[136] An XAI-based Analysis of Shortcut Learning in Neural Networks

Phuong Quynh Le,Jörg Schlötterer,Christin Seifert

Main category: cs.LG

TL;DR: 论文提出了一种XAI诊断方法(神经元虚假分数),用于量化神经元对虚假特征的依赖,并分析了CNN和ViT中虚假特征的解耦程度。

Details Motivation: 现有方法无法完全解决模型对虚假特征的依赖问题,需要系统分析神经网络如何编码虚假相关性。 Method: 引入神经元虚假分数,结合架构特定方法分析CNN和ViT。 Result: 虚假特征部分解耦,但程度因架构而异;现有缓解方法的假设不完整。 Conclusion: 为开发新方法缓解虚假相关性奠定基础,提升AI模型安全性。 Abstract: Machine learning models tend to learn spurious features - features that strongly correlate with target labels but are not causal. Existing approaches to mitigate models' dependence on spurious features work in some cases, but fail in others. In this paper, we systematically analyze how and where neural networks encode spurious correlations. We introduce the neuron spurious score, an XAI-based diagnostic measure to quantify a neuron's dependence on spurious features. We analyze both convolutional neural networks (CNNs) and vision transformers (ViTs) using architecture-specific methods. Our results show that spurious features are partially disentangled, but the degree of disentanglement varies across model architectures. Furthermore, we find that the assumptions behind existing mitigation methods are incomplete. Our results lay the groundwork for the development of novel methods to mitigate spurious correlations and make AI models safer to use in practice.

econ.GN [Back]

[137] Real-Time Sentiment Insights from X Using VADER, DistilBERT, and Web-Scraped Data

Yanampally Abhiram Reddy,Siddhi Agarwal,Vikram Parashar,Arshiya Arora

Main category: econ.GN

TL;DR: 该论文提出了一种结合NLP和机器学习的实时企业声誉监测系统,通过混合情感检测框架分析社交媒体数据,揭示了不同企业的公众情感差异。

Details Motivation: 在社交媒体时代,了解公众对企业的情感对投资者、政策制定者和研究者至关重要。 Method: 采用混合情感检测框架(VADER规则模型和DistilBERT深度学习模型),结合数据预处理和集成分类方法。 Result: 不同企业情感得分差异显著,如亚马逊(81.2)和三星(45.8)表现优异,微软(21.7)和沃尔玛(21.9)较差。 Conclusion: 该系统能为利益相关者提供全面的情感分析,支持基于数据的战略决策。 Abstract: In the age of social media, understanding public sentiment toward major corporations is crucial for investors, policymakers, and researchers. This paper presents a comprehensive sentiment analysis system tailored for corporate reputation monitoring, combining Natural Language Processing (NLP) and machine learning techniques to accurately interpret public opinion in real time. The methodology integrates a hybrid sentiment detection framework leveraging both rule-based models (VADER) and transformer-based deep learning models (DistilBERT), applied to social media data from multiple platforms. The system begins with robust preprocessing involving noise removal and text normalization, followed by sentiment classification using an ensemble approach to ensure both interpretability and contextual accuracy. Results are visualized through sentiment distribution plots, comparative analyses, and temporal sentiment trends for enhanced interpretability. Our analysis reveals significant disparities in public sentiment across major corporations, with companies like Amazon (81.2) and Samsung (45.8) receiving excellent sentiment scores, while Microsoft (21.7) and Walmart (21.9) exhibit poor sentiment profiles. These findings demonstrate the utility of our multi-source sentiment framework in providing actionable insights regarding corporate public perception, enabling stakeholders to make informed strategic decisions based on comprehensive sentiment analysis.