Motivation: 解决现有扩散方法计算成本高、采样步骤多的问题。 Method: 采用动态控制的扩散蒸馏管道，结合文本提示增强生成能力。 Result: 在多种任务和数据集上表现高效，生成高质量图像。 Conclusion: InstaRevive在图像增强中展现出高效性和高质量结果。 Abstract: Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at https://github.com/EternalEvan/InstaRevive.

Shichen Li,Chenhui Shao

Main category: cs.CV

TL;DR: 提出了一种多模态数据融合框架，用于实时预测食品干燥状态，显著提升了预测精度和效率。

Details

Motivation: 食品干燥的实时预测对节能、生产效率和产品质量至关重要，但现有方法因数据动态性和有限性面临挑战。 Method: 采用端到端多模态数据融合框架，结合视频数据和工艺参数，使用编码器-解码器架构和基于Transformer的解码器。 Result: 模型在糖饼干干燥实验中平均预测误差仅15秒，优于现有方法65.69%。 Conclusion: 该模型在精度、规模和效率上表现优异，适用于多种工业多模态融合任务。 Abstract: Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.

[17] SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

Yunfeng Li,Bo Wang,Jiahao Wan,Xueyi Wu,Ye Li

Main category: cs.CV

TL;DR: 论文提出了首个大规模水下声学目标跟踪基准SonarT165，并提出了高效框架STFTrack，包含多视角模板融合和最优轨迹校正模块，显著提升了性能。

Details

Motivation: 水下观测系统在能见度不足时依赖声纳系统，但缺乏统一评估基准限制了现有方法的实用性。 Method: 提出SonarT165基准和STFTrack框架，包含多视角模板融合模块（MTFM）和最优轨迹校正模块（OTCM），并引入声学图像增强和频率增强模块。 Result: STFTrack在SonarT165基准上实现了最先进的性能。 Conclusion: SonarT165基准和STFTrack框架为水下声学目标跟踪提供了有效工具，解决了现有方法的局限性。 Abstract: Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.

[18] HS-Mamba: Full-Field Interaction Multi-Groups Mamba for Hyperspectral Image Classification

Hongxing Peng,Kang Lin,Huanai Liu

Main category: cs.CV

TL;DR: 论文提出了一种基于Mamba架构的HS-Mamba框架，用于高光谱图像分类，结合局部和全局特征，显著提升了分类精度。

Details

Motivation: 高光谱图像分类是遥感领域的热点，但高维度和特征内联特性使Mamba架构的应用面临挑战。 Method: HS-Mamba采用双通道空间-光谱编码器模块和轻量级全局内联注意力分支，结合局部和全局特征。 Result: 在四个基准数据集上，HS-Mamba优于现有最先进方法。 Conclusion: HS-Mamba通过融合局部和全局特征，实现了高精度的高光谱图像分类。 Abstract: Hyperspectral image (HSI) classification has been one of the hot topics in remote sensing fields. Recently, the Mamba architecture based on selective state-space models (S6) has demonstrated great advantages in long sequence modeling. However, the unique properties of hyperspectral data, such as high dimensionality and feature inlining, pose challenges to the application of Mamba to HSI classification. To compensate for these shortcomings, we propose an full-field interaction multi-groups Mamba framework (HS-Mamba), which adopts a strategy different from pixel-patch based or whole-image based, but combines the advantages of both. The patches cut from the whole image are sent to multi-groups Mamba, combined with positional information to perceive local inline features in the spatial and spectral domains, and the whole image is sent to a lightweight attention module to enhance the global feature representation ability. Specifically, HS-Mamba consists of a dual-channel spatial-spectral encoder (DCSS-encoder) module and a lightweight global inline attention (LGI-Att) branch. The DCSS-encoder module uses multiple groups of Mamba to decouple and model the local features of dual-channel sequences with non-overlapping patches. The LGI-Att branch uses a lightweight compressed and extended attention module to perceive the global features of the spatial and spectral domains of the unsegmented whole image. By fusing local and global features, high-precision classification of hyperspectral images is achieved. Extensive experiments demonstrate the superiority of the proposed HS-Mamba, outperforming state-of-the-art methods on four benchmark HSI datasets.

[106] A Python Tool for Reconstructing Full News Text from GDELT

A. Fronzetti Colladon,R. Vestrelli

Main category: cs.CL

TL;DR: 论文提出了一种利用GDELT数据集低成本获取全文新闻的方法，解决了现有新闻数据集的访问限制问题。

Details

Motivation: 新闻数据在多学科研究中至关重要，但现有数据集成本高或不完整，限制了研究。 Method: 利用GDELT Web News NGrams 3.0数据集，通过Python代码重构全文新闻。 Result: 实现了低成本获取结构化、大规模的新闻数据，支持文本分析。 Conclusion: 该方法提升了新闻数据的可访问性，助力经济预测、计算社会科学和自然语言处理研究。 Abstract: News data have become an essential resource across various disciplines, including economics, finance, management, social sciences, and computer science. Researchers leverage newspaper articles to study economic trends, market dynamics, corporate strategies, public perception, political discourse, and the evolution of public opinion. Additionally, news datasets have been instrumental in training large-scale language models, with applications in sentiment analysis, fake news detection, and automated news summarization. Despite their significance, access to comprehensive news corpora remains a key challenge. Many full-text news providers, such as Factiva and LexisNexis, require costly subscriptions, while free alternatives often suffer from incomplete data and transparency issues. This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost by leveraging data from the Global Database of Events, Language, and Tone (GDELT). Specifically, we focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources. We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them. Our method enables researchers to access structured, large-scale newspaper data for text analysis while overcoming the limitations of existing proprietary datasets. The proposed approach enhances the accessibility of news data for empirical research, facilitating applications in economic forecasting, computational social science, and natural language processing.

Zhiyuan Hu,Shiyun Xiong,Yifan Zhang,See-Kiong Ng,Anh Tuan Luu,Bo An,Shuicheng Yan,Bryan Hooi

Main category: cs.CL

TL;DR: 提出了一种通过奖励模型在推理时引导视觉语言模型（VLM）代理的方法，显著提升了GUI导航任务的性能。

Details

Motivation: 当前VLM在复杂GUI环境中生成正确动作的能力有限，且现有评估和优化技术存在反馈延迟和局部优化问题。 Method: 在推理时通过奖励模型对VLM代理进行过程监督，优化每一步动作，并结合轨迹反思和重试机制。 Result: 在静态环境中单步动作准确率提升3.4%，动态环境中任务成功率提升约33%。 Conclusion: 该方法有效提升了VLM在GUI任务中的性能，尤其在动态环境中表现突出。 Abstract: Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

[108] PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu,Shaoyang Guo,Zhuo-Yang Song,Yunbo Sun,Zeyu Cai,Jiashen Wei,Tianyu Luo,Yixuan Yin,Haoxu Zhang,Yi Hu,Chenyang Wang,Chencheng Tang,Haoling Chang,Qi Liu,Ziheng Zhou,Tianyu Zhang,Jingtian Zhang,Zhangyi Liu,Minghao Li,Yuku Zhang,Boxuan Jing,Xianqi Yin,Yutong Ren,Zizhuo Fu,Weike Wang,Xudong Tian,Anqi Lv,Laifu Man,Jianxiang Li,Feiyu Tao,Qihua Sun,Zhou Liang,Yushu Mu,Zhongxuan Li,Jing-Jun Zhang,Shutao Zhang,Xiaotian Li,Xingqi Xia,Jiawei Lin,Zheyu Shen,Jiahang Chen,Qiuhao Xiong,Binran Wang,Fengyuan Wang,Ziyang Ni,Bohan Zhang,Fan Cui,Changkun Shao,Qing-Hong Cao,Ming-xing Luo,Muhan Zhang,Hua Xing Zhu

Main category: cs.CL

TL;DR: PHYBench是一个用于评估大语言模型在物理场景中推理能力的高质量基准，包含500个精心设计的物理问题，并提出新的评估指标EED Score。测试结果显示，现有模型在复杂物理推理上仍显著落后于人类专家。

Details

Motivation: 评估大语言模型在真实物理场景中的推理能力，揭示其局限性并推动改进。 Method: 构建包含500个物理问题的PHYBench基准，并提出基于数学表达式编辑距离的EED Score评估指标。 Result: 现有最先进的推理模型在复杂物理推理上显著落后于人类专家。 Conclusion: PHYBench和EED Score为评估和改进大语言模型的物理推理能力提供了有效工具。 Abstract: We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

[109] TTRL: Test-Time Reinforcement Learning

Yuxin Zuo,Kaiyan Zhang,Shang Qu,Li Sheng,Xuekai Zhu,Biqing Qi,Youbang Sun,Ganqu Cui,Ning Ding,Bowen Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为TTRL的新方法，利用无标签数据通过强化学习训练大型语言模型，显著提升模型性能。

Details

Motivation: 研究在缺乏显式标签的情况下，如何通过强化学习优化大型语言模型在推理任务中的表现。 Method: 提出Test-Time Reinforcement Learning (TTRL)，利用预训练模型的先验知识，通过多数投票等测试时缩放方法生成奖励信号。 Result: TTRL显著提升了模型性能，例如在AIME 2024任务中，Qwen-2.5-Math-7B的pass@1性能提高了约159%。 Conclusion: TTRL在多种任务中表现出广泛的有效性，展示了其在无标签数据上的潜力。 Abstract: This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

cs.AI [Back]

[110] Learning Adaptive Parallel Reasoning with Language Models

Jiayi Pan,Xiuyu Li,Long Lian,Charlie Snell,Yifei Zhou,Adam Yala,Trevor Darrell,Kurt Keutzer,Alane Suhr

Main category: cs.AI

TL;DR: APR是一种新型推理框架，通过自适应并行推理优化语言模型的计算分配，显著提升性能和可扩展性。

Details

Motivation: 现有推理方法存在输出过长或协调不足的问题，限制了语言模型的推理能力和效率。 Method: 提出APR框架，结合串行和并行计算，使用spawn()和join()操作，并通过强化学习优化推理线程。 Result: 在Countdown任务中，APR在相同上下文窗口、计算量和延迟下均表现更优。 Conclusion: APR为语言模型自适应优化推理过程提供了新方向。 Abstract: Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

Daocheng Fu,Zijun Chen,Renqiu Xia,Qi Liu,Yuan Feng,Hongbin Zhou,Renrui Zhang,Shiyang Feng,Peng Gao,Junchi Yan,Botian Shi,Bo Zhang,Yu Qiao

Main category: cs.AI

TL;DR: 论文提出了一种名为TrustGeoGen的数据引擎，用于生成几何问题，并通过形式化验证提供基准，解决了现有合成几何问题基准的噪声和自相矛盾问题。

Details

Motivation: 解决几何问题求解中多模态信息整合和逻辑一致性的挑战，填补现有基准在方法论和验证上的不足。 Method: 提出TrustGeoGen引擎，通过多模态对齐生成、形式化验证、自举机制和GeoExplore算法，生成高质量几何问题数据集。 Result: 生成GeoTrust-200K数据集和测试集，实验显示现有模型在测试集上准确率仅为49.17%，但训练后的模型在OOD泛化上表现优异。 Conclusion: TrustGeoGen为几何问题求解提供了可靠的基准和方法，显著减少了逻辑不一致性。 Abstract: Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17\% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at https://github.com/Alpha-Innovator/TrustGeoGen

[130] Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence

Baichuan Zeng

[137] Real-Time Sentiment Insights from X Using VADER, DistilBERT, and Web-Scraped Data

Yanampally Abhiram Reddy,Siddhi Agarwal,Vikram Parashar,Arshiya Arora

Main category: econ.GN

TL;DR: 该论文提出了一种结合NLP和机器学习的实时企业声誉监测系统，通过混合情感检测框架分析社交媒体数据，揭示了不同企业的公众情感差异。

Details

Motivation: 在社交媒体时代，了解公众对企业的情感对投资者、政策制定者和研究者至关重要。 Method: 采用混合情感检测框架（VADER规则模型和DistilBERT深度学习模型），结合数据预处理和集成分类方法。 Result: 不同企业情感得分差异显著，如亚马逊（81.2）和三星（45.8）表现优异，微软（21.7）和沃尔玛（21.9）较差。 Conclusion: 该系统能为利益相关者提供全面的情感分析，支持基于数据的战略决策。 Abstract: In the age of social media, understanding public sentiment toward major corporations is crucial for investors, policymakers, and researchers. This paper presents a comprehensive sentiment analysis system tailored for corporate reputation monitoring, combining Natural Language Processing (NLP) and machine learning techniques to accurately interpret public opinion in real time. The methodology integrates a hybrid sentiment detection framework leveraging both rule-based models (VADER) and transformer-based deep learning models (DistilBERT), applied to social media data from multiple platforms. The system begins with robust preprocessing involving noise removal and text normalization, followed by sentiment classification using an ensemble approach to ensure both interpretability and contextual accuracy. Results are visualized through sentiment distribution plots, comparative analyses, and temporal sentiment trends for enhanced interpretability. Our analysis reveals significant disparities in public sentiment across major corporations, with companies like Amazon (81.2) and Samsung (45.8) receiving excellent sentiment scores, while Microsoft (21.7) and Walmart (21.9) exhibit poor sentiment profiles. These findings demonstrate the utility of our multi-source sentiment framework in providing actionable insights regarding corporate public perception, enabling stakeholders to make informed strategic decisions based on comprehensive sentiment analysis.

Table of Contents

cs.CV [Back]

[1] LLM-Enabled Style and Content Regularization for Personalized Text-to-Image Generation

[2] LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

[3] Model-based Metric 3D Shape and Motion Reconstruction of Wild Bottlenose Dolphins in Drone-Shot Videos

[4] Event2Vec: Processing neuromorphic events directly by representations in vector space

[5] Towards Understanding Camera Motions in Any Video

[6] Physics Driven Image Simulation from Commercial Satellite Imagery

[7] Plug-and-Play Versatile Compressed Video Enhancement

[8] ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images

[9] MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

[10] Context Aware Grounded Teacher for Source Free Object Detection

[11] IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

[12] Manifold Induced Biases for Zero-shot and Few-shot Detection of Generated Images

[13] Emergence and Evolution of Interpretable Concepts in Diffusion Models

[14] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

[15] InstaRevive: One-Step Image Enhancement via Dynamic Score Matching

[16] Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness

[17] SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

[18] HS-Mamba: Full-Field Interaction Multi-Groups Mamba for Hyperspectral Image Classification

[19] AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization

[20] FaceInsight: A Multimodal Large Language Model for Face Perception

[21] ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

[22] AffordanceSAM: Segment Anything Once More in Affordance Grounding

[23] DiTPainter: Efficient Video Inpainting with Diffusion Transformers

[24] Motion-Enhanced Nonlocal Similarity Implicit Neural Representation for Infrared Dim and Small Target Detection

[25] DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining

[26] Vidi: Large Multimodal Models for Video Understanding and Editing

[27] You Sense Only Once Beneath: Ultra-Light Real-Time Underwater Object Detection

[28] RePOPE: Impact of Annotation Errors on the POPE Benchmark

[29] Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

[30] SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems

[31] GADS: A Super Lightweight Model for Head Pose Estimation

[32] DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy

[33] Multi-Scale Tensorial Summation and Dimensional Reduction Guided Neural Network for Edge Detection

[34] Pose Optimization for Autonomous Driving Datasets using Neural Rendering Models

[35] Towards prediction of morphological heart age from computed tomography angiography

[36] Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views

[37] Development and evaluation of a deep learning algorithm for German word recognition from lip movements

[38] Locating and Mitigating Gradient Conflicts in Point Cloud Domain Adaptation via Saliency Map Skewness

[39] Human-Imperceptible Physical Adversarial Attack for NIR Face Recognition Models

[40] Text-based Animatable 3D Avatars with Morphable Model Alignment

[41] DERD-Net: Learning Depth from Event-based Ray Densities

[42] MedNNS: Supernet-based Medical Task-Adaptive Neural Network Search

[43] Integrating Non-Linear Radon Transformation for Diabetic Retinopathy Grading

[44] MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

[45] Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

[46] ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

[47] A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers

[48] Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models

[49] Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time

[50] Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

[51] FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

[52] Efficient Adaptation of Deep Neural Networks for Semantic Segmentation in Space Applications

[53] MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

[54] Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

[55] Survey of Video Diffusion Models: Foundations, Implementations, and Applications

[56] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

[57] LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

[58] Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

[59] Vision language models are unreliable at trivial spatial cognition

[60] Boosting Generative Image Modeling via Joint Image-Feature Synthesis

[61] Describe Anything: Detailed Localized Image and Video Captioning

[62] From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

[63] MR. Video: "MapReduce" is the Principle for Long Video Understanding

[64] MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

cs.GR [Back]

[65] Vision6D: 3D-to-2D Interactive Visualization and Annotation Tool for 6D Pose Estimation

[66] Iris: A Next Generation Digital Pathology Rendering Engine

[67] Neural Kinematic Bases for Fluids

[68] Low-Rank Adaptation of Neural Fields

[69] SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

cs.CL [Back]

[70] Exploring Compositional Generalization (in ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

[71] Tell Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection

[72] Trillion 7B Technical Report

[73] Feeding LLM Annotations to BERT Classifiers at Your Own Risk

[74] Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

[75] Speculative Sampling via Exponential Races

[76] SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation