cs.CV [Back]

[1] Dense Air Pollution Estimation from Sparse in-situ Measurements and Satellite Data

Ruben Gonzalez Avilés,Linus Scheibenreif,Damian Borth

Main category: cs.CV

TL;DR: 本文提出了一种新的密集估计技术，用于高效估算全球环境中的二氧化氮（NO₂）浓度，解决了现有方法的计算资源限制问题，并显著提高了准确性。

Details

Motivation: 现有卫星空气质量估算方法存在计算资源密集的局限性，无法高效提供大范围的准确估计。本研究旨在平衡高分辨率估算的准确性与计算效率。 Method: 采用均匀随机偏移采样策略，将地面真实数据均匀分散到更大区域，通过密集估计方法一步生成网格估算，显著减少计算资源需求。 Result: 新方法的平均绝对误差（MAE）为4.98 μg/m³，比现有点估计方法提高了9.45%，兼具高精度和计算效率。 Conclusion: 该方法为大规模环境监测提供了可行的解决方案，展示了其适应性和鲁棒性。 Abstract: This paper addresses the critical environmental challenge of estimating ambient Nitrogen Dioxide (NO$_2$) concentrations, a key issue in public health and environmental policy. Existing methods for satellite-based air pollution estimation model the relationship between satellite and in-situ measurements at select point locations. While these approaches have advanced our ability to provide air quality estimations on a global scale, they come with inherent limitations. The most notable limitation is the computational intensity required for generating comprehensive estimates over extensive areas. Motivated by these limitations, this study introduces a novel dense estimation technique. Our approach seeks to balance the accuracy of high-resolution estimates with the practicality of computational constraints, thereby enabling efficient and scalable global environmental assessment. By utilizing a uniformly random offset sampling strategy, our method disperses the ground truth data pixel location evenly across a larger patch. At inference, the dense estimation method can then generate a grid of estimates in a single step, significantly reducing the computational resources required to provide estimates for larger areas. Notably, our approach also surpasses the results of existing point-wise methods by a significant margin of $9.45\%$, achieving a Mean Absolute Error (MAE) of $4.98\ \mu\text{g}/\text{m}^3$. This demonstrates both high accuracy and computational efficiency, highlighting the applicability of our method for global environmental assessment. Furthermore, we showcase the method's adaptability and robustness by applying it to diverse geographic regions. Our method offers a viable solution to the computational challenges of large-scale environmental monitoring.

[2] DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Zhenhailong Wang,Senthil Purushwalkam,Caiming Xiong,Silvio Savarese,Heng Ji,Ran Xu

Main category: cs.CV

TL;DR: DyMU是一个无需训练的高效框架，通过动态合并视觉标记和虚拟重建标记序列，显著减少视觉语言模型的计算负担，同时保持高性能。

Details

Motivation: 解决视觉语言模型中固定长度输出导致的效率低下问题，同时避免额外的微调需求。 Method: 采用动态标记合并（DToMe）和虚拟标记解合并（VTU）技术，根据图像内容动态调整标记压缩。 Result: 实验显示，DyMU能减少32%-85%的视觉标记数量，性能与完整模型相当。 Conclusion: DyMU提供了一种高效、无需训练的动态压缩方法，适用于多种先进视觉语言模型架构。 Abstract: We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

[3] PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation

Xinqi Xiong,Andrea Dunn Beltran,Jun Myeong Choi,Marc Niethammer,Roni Sengupta

Main category: cs.CV

TL;DR: 提出了一种结合Stable Diffusion和ControlNet的图像翻译框架，利用Per-Pixel Shading（PPS）图生成更真实的纹理，提升深度估计效果。

Details

Motivation: 临床环境中获取真实深度数据困难，合成数据训练存在领域差距，需改进图像翻译方法以提升深度估计的泛化能力。 Method: 结合Stable Diffusion与ControlNet，以PPS图的潜在表示为条件，生成保留结构的真实纹理。 Result: 实验表明，该方法生成的图像更真实，深度估计效果优于基于GAN的MI-CycleGAN。 Conclusion: 提出的框架有效缩小了合成与临床数据的领域差距，提升了深度估计的准确性。 Abstract: Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.

[4] Distilling semantically aware orders for autoregressive image generation

Rishav Pramanik,Antoine Poupon,Juan A. Rodriguez,Masih Aminbeidokhti,David Vazquez,Christopher Pal,Zhaozheng Yin,Marco Pedersoli

Main category: cs.CV

TL;DR: 该论文提出了一种改进的自回归图像生成方法，通过训练模型以任意顺序生成图像块，并利用提取的顺序微调模型，从而生成更高质量的图像。

Details

Motivation: 传统的自回归图像生成模型采用固定的光栅扫描顺序（从左到右、从上到下），但这种顺序忽视了图像内容之间的因果关系，导致生成效果不理想。 Method: 首先训练模型以任意顺序生成图像块，推断每个块的内容和位置；然后利用提取的顺序微调模型。 Result: 实验表明，该方法在两个数据集上生成的图像质量优于传统光栅扫描方法，且训练成本和额外标注需求相同。 Conclusion: 通过动态推断生成顺序，可以显著提升自回归图像生成的质量。 Abstract: Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.

[5] Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection

Jens Petersen,Davide Abati,Amirhossein Habibian,Auke Wiggers

Main category: cs.CV

TL;DR: 论文提出了一种场景感知的概率位置模型，用于预测新物体在现有场景中的合理位置，并通过生成模型在这些位置填充物体，从而提升数据增强的效果。

Details

Motivation: 现有生成图像模型在汽车目标检测任务中的数据增强方法忽视了物体在场景中的布局合理性，导致增强效果不佳。 Method: 引入场景感知的概率位置模型，预测新物体的合理位置，并结合生成模型进行物体填充。 Result: 在两项汽车目标检测任务中，该方法取得了最佳性能，mAP提升高达2.8倍（+1.4 vs. +0.5）。实例分割任务也有显著改进。 Conclusion: 通过关注场景布局的合理性，该方法显著提升了生成数据增强的效果，为相关任务提供了新的技术方向。 Abstract: Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to $2.8\times$ higher gains than the best competing approach ($+1.4$ vs. $+0.5$ mAP boost). We also demonstrate significant improvements for instance segmentation.

[6] Transferring Spatial Filters via Tangent Space Alignment in Motor Imagery BCIs

Tekin Gunasar,Virginia de Sa

Main category: cs.CV

TL;DR: 提出了一种通过黎曼流形对齐协方差矩阵和改进CSP空间滤波器的方法，以提升运动想象BCI中的主题迁移效果。

Details

Motivation: 解决运动想象BCI中主题迁移性能不足的问题，尤其是在训练数据有限的情况下。 Method: 在黎曼流形上对齐协方差矩阵，并设计新的CSP空间滤波器，同时探索多主题信息整合方式。 Result: 在三个数据集上表现略优于标准CSP，且在训练数据有限时改进更显著。 Conclusion: 该方法在有限数据条件下显著提升主题迁移性能，为BCI应用提供了更优解决方案。 Abstract: We propose a method to improve subject transfer in motor imagery BCIs by aligning covariance matrices on a Riemannian manifold, followed by computing a new common spatial patterns (CSP) based spatial filter. We explore various ways to integrate information from multiple subjects and show improved performance compared to standard CSP. Across three datasets, our method shows marginal improvements over standard CSP; however, when training data are limited, the improvements become more significant.

[7] Latent Video Dataset Distillation

Ning Li,Antai Andy Liu,Jingran Zhang,Justin Cui

Main category: cs.CV

TL;DR: 本文提出了一种新的视频数据集蒸馏方法，通过在潜在空间操作并结合多样性感知数据选择策略，显著提升了性能。

Details

Motivation: 现有视频数据集蒸馏方法主要在像素空间压缩，忽略了潜在空间的进展，本文旨在填补这一空白。 Method: 使用先进的变分编码器在潜在空间操作，结合多样性感知数据选择策略，并引入无需训练的压缩方法。 Result: 在所有数据集上均优于现有方法，例如在HMDB51 IPC 1上性能提升2.6%，在MiniUCF IPC 5上提升7.8%。 Conclusion: 该方法在视频数据集蒸馏领域取得了新的最佳性能。 Abstract: Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase.

[8] A Comprehensive Review on RNA Subcellular Localization Prediction

Cece Zhang,Xuehuan Zhu,Nick Peterson,Jieqiong Wang,Shibiao Wan

Main category: cs.CV

TL;DR: 本文综述了基于AI/ML的RNA亚细胞定位预测方法的最新进展，包括序列、图像和混合方法，并讨论了其挑战与机遇。

Details

Motivation: 传统湿实验室方法耗时、资源密集且昂贵，因此需要计算方法的替代方案以大规模预测RNA亚细胞定位。 Method: 综述了基于AI/ML的序列、图像和混合方法，用于预测RNA亚细胞定位。 Result: AI/ML方法在加速RNA研究、揭示分子通路和指导疾病治疗方面具有潜力。 Conclusion: 本文为RNA亚细胞定位领域的研究者提供了有价值的资源，并指出了数据稀缺和缺乏基准等挑战。 Abstract: The subcellular localization of RNAs, including long non-coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time-consuming, resource-demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large-scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI-based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence-based, image-based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.

Kai Cui,Jia Li,Yu Liu,Xuesong Zhang,Zhenzhen Hu,Meng Wang

Main category: cs.CV

TL;DR: PhysioSync是一个新颖的预训练框架，利用时间和跨模态对比学习，通过动态同步和一致性建模提升EEG情感识别的性能。

Details

Motivation: EEG信号虽然能反映情感状态，但存在噪声和个体差异问题，且现有方法忽略了跨模态的动态同步和语义一致性。 Method: 提出PhysioSync框架，结合跨模态一致性对齐（CM-CA）和长短时时间对比学习（LS-TCL），预训练后通过特征融合和微调提升性能。 Result: 在DEAP和DREAMER数据集上，PhysioSync在单模态和跨模态条件下均表现出优越性能。 Conclusion: PhysioSync通过建模动态同步和多时间分辨率特征，显著提升了EEG情感识别的效果。 Abstract: Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.

[10] A Genealogy of Multi-Sensor Foundation Models in Remote Sensing

Kevin Lane,Morteza Karimzadeh

Main category: cs.CV

TL;DR: 本文探讨了遥感领域中基础模型的开发与应用，分析了其与计算机视觉领域的联系，并提出了改进方向。

Details

Motivation: 研究遥感领域基础模型的潜力与挑战，以推动其进一步发展。 Method: 分析现有基础模型方法及其在计算机视觉中的根源，探讨多传感器数据的利用。 Result: 总结了现有方法的优缺点，提出了改进方向，如多传感器数据利用和减少计算资源需求。 Conclusion: 未来应进一步利用未标记、季节性和多传感器遥感数据，优化基础模型。 Abstract: Foundation models have garnered increasing attention for representation learning in remote sensing, primarily adopting approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches that each come with significant benefits and drawbacks. This paper examines these approaches along with their roots in the computer vision field in order to characterize potential advantages and pitfalls while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We place emphasis on the multi-sensor aspect of Earth observations, and the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.

[11] We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

Minkyu Choi,S P Sharan,Harsh Goel,Sahil Shah,Sandeep Chinchali

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的零训练视频优化方法，通过神经符号反馈提升文本生成视频的语义和时间一致性。

Details

Motivation: 当前文本生成视频模型在处理复杂或长文本提示时，难以保持语义和时间一致性，且直接改进成本高昂。 Method: 提出一种神经符号反馈驱动的视频优化流程，通过分析视频形式表示并定位不一致事件和对象，指导针对性编辑。 Result: 实验表明，该方法显著提升了视频与提示的时序和逻辑对齐，改进幅度近40%。 Conclusion: 该零训练方法有效解决了文本生成视频模型的语义和时间一致性问题，具有广泛适用性。 Abstract: Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce $\projectname$, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that $\projectname$ significantly enhances temporal and logical alignment across diverse prompts by almost $40\%$.

[12] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Phillip Y. Lee,Jihyeon Je,Chanho Park,Mikaela Angelina Uy,Leonidas Guibas,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出了一种基于心理意象模拟的视角感知推理框架（APC），用于提升视觉语言模型的视角转换能力。

Details

Motivation: 视角感知是人类视觉理解的关键能力，但现有视觉语言模型在此方面表现不足，存在自我中心偏见。 Method: 通过抽象视角变换（APC）框架，利用视觉基础模型（如目标检测、分割和方向估计）构建场景抽象并实现视角转换。 Result: 在合成和真实图像基准测试中，APC框架显著提升了视角感知推理能力，优于现有空间推理模型和新视角合成方法。 Conclusion: APC框架填补了视觉语言模型与人类视角感知之间的差距，为环境交互和自主代理协作提供了新思路。 Abstract: We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

[13] MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing

Shiwen Cao,Zhaoxing Zhang,Junming Jiao,Juyi Qiao,Guowen Song,Rong Shen

Main category: cs.CV

TL;DR: MCAF是一种基于代理的无训练框架，通过多模态粗到细注意力聚焦实现视频理解，显著提升了长视频理解的性能。

Details

Motivation: 在大型模型快速发展的时代，长视频理解仍然具有挑战性，因为视频信息冗余且需要全局注意力分配。 Method: MCAF通过多模态信息分层聚焦相关帧，并采用扩张时间扩展机制避免遗漏关键细节，同时引入自反机制迭代优化注意力。 Result: 在EgoSchema数据集上性能提升5%，在Next-QA和IntentQA数据集上分别提升0.2%和0.3%，在Video-MME数据集上也优于其他基于代理的方法。 Conclusion: MCAF通过创新的注意力聚焦策略，显著提升了长视频理解的准确性和性能。 Abstract: Even in the era of rapid advances in large models, video understanding, particularly long videos, remains highly challenging. Compared with textual or image-based information, videos commonly contain more information with redundancy, requiring large models to strategically allocate attention at a global level for accurate comprehension. To address this, we propose MCAF, an agent-based, training-free framework perform video understanding through Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its ability to sense and prioritize segments of the video that are highly relevant to the understanding task. First, MCAF hierarchically concentrates on highly relevant frames through multimodal information, enhancing the correlation between the acquired contextual information and the query. Second, it employs a dilated temporal expansion mechanism to mitigate the risk of missing crucial details when extracting information from these concentrated frames. In addition, our framework incorporates a self-reflection mechanism utilizing the confidence level of the model's responses as feedback. By iteratively applying these two creative focusing strategies, it adaptively adjusts attention to capture highly query-connected context and thus improves response accuracy. MCAF outperforms comparable state-of-the-art methods on average. On the EgoSchema dataset, it achieves a remarkable 5% performance gain over the leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms the current state-of-the-art standard by 0.2% and 0.3% respectively. On the Video-MME dataset, which features videos averaging nearly an hour in length, MCAF also outperforms other agent-based methods.

Mengyu Qiao,Runze Tian,Yang Wang

Main category: cs.CV

TL;DR: 提出了一种结合多尺度空间-频率分析的新型深度伪造检测框架，显著提升了检测精度和泛化能力。

Details

Motivation: 现有方法主要依赖空间域分析，频率域操作仅限于特征级增强，未能充分利用频率原生伪影和空间-频率交互。 Method: 框架包含三个关键组件：局部光谱特征提取、全局光谱特征提取和多阶段跨模态融合机制。 Result: 在广泛采用的基准测试中，该方法在精度和泛化性上均优于现有最先进的深度伪造检测方法。 Conclusion: 该框架通过多尺度空间-频率分析有效解决了深度伪造检测中的性能退化问题。 Abstract: The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.

[15] Visual and textual prompts for enhancing emotion recognition in video

Zhifeng Wang,Qixuan Zhang,Peter Zhang,Wenjia Niu,Kaihao Zhang,Ramesh Sankaranarayana,Sabrina Caldwell,Tom Gedeon

Main category: cs.CV

TL;DR: SoVTP是一种新颖的框架，通过整合空间注释、生理信号和上下文提示，提升VLLMs在视频情感识别中的零样本能力。

Details

Motivation: 现有VLLMs在视频情感识别中因空间和上下文意识不足而受限，传统方法忽视非语言线索，导致鲁棒性下降。 Method: 提出SoVTP框架，结合空间注释（如边界框、面部标志）、生理信号（面部动作单元）和上下文线索（身体姿势、场景动态等）的统一提示策略。 Result: 实验表明，SoVTP在视频情感识别中显著优于现有视觉提示方法。 Conclusion: SoVTP有效增强了VLLMs的视频情感识别能力，保留了场景整体信息并支持细粒度分析。 Abstract: Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others' emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs' video emotion recognition capabilities.

[16] Range Image-Based Implicit Neural Compression for LiDAR Point Clouds

Akihiro Kuwabara,Sorachi Kato,Takuya Fujihashi,Toshiaki Koike-Akino,Takashi Watanabe

Main category: cs.CV

TL;DR: 本文提出了一种基于隐式神经表示（INR）的LiDAR点云压缩方法，通过将2D范围图像（RIs）分为深度和掩码图像，并采用模型剪枝和量化技术，显著提升了压缩效率和3D重建质量。

Details

Motivation: 传统图像压缩技术在处理RIs时效率有限，因为RIs与自然图像在比特精度和像素值分布上存在显著差异。因此，需要一种更高效的压缩方法。 Method: 将RIs分为深度和掩码图像，分别采用基于补丁和像素的INR架构，结合模型剪枝和量化技术进行压缩。 Result: 在KITTI数据集上的实验表明，该方法在低比特率和低解码延迟下，优于现有的图像、点云、RI和INR压缩方法。 Conclusion: 该方法为高效压缩LiDAR点云提供了新思路，同时保持了高精度的3D重建和检测质量。 Abstract: This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)--based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.

[17] Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment

Zhiqiang Lao,Heather Yu

Main category: cs.CV

TL;DR: 提出了一种结合深度学习与传统图像质量评估（IQA）的新方法，通过分离高级语义和低级感知特征，更准确地反映人类视觉感知。

Details

Motivation: 随着人工智能和智能手机的普及，图像数据激增，传统IQA方法在处理深度神经网络（DNN）生成的图像时表现不足，亟需更符合人类感知的评估方法。 Method: 将深度特征分解为高级语义和低级感知信息，结合传统IQA指标，通过多层感知机（MLP）生成质量评分。 Result: 实验表明，该方法比现有IQA模型更符合人类感知判断。 Conclusion: 该方法有效结合了深度学习与传统IQA，提升了图像质量评估的准确性。 Abstract: The rapid advancement of artificial intelligence and widespread use of smartphones have resulted in an exponential growth of image data, both real (camera-captured) and virtual (AI-generated). This surge underscores the critical need for robust image quality assessment (IQA) methods that accurately reflect human visual perception. Traditional IQA techniques primarily rely on spatial features - such as signal-to-noise ratio, local structural distortions, and texture inconsistencies - to identify artifacts. While effective for unprocessed or conventionally altered images, these methods fall short in the context of modern image post-processing powered by deep neural networks (DNNs). The rise of DNN-based models for image generation, enhancement, and restoration has significantly improved visual quality, yet made accurate assessment increasingly complex. To address this, we propose a novel IQA approach that bridges the gap between deep learning methods and human perception. Our model disentangles deep features into high-level semantic information and low-level perceptual details, treating each stream separately. These features are then combined with conventional IQA metrics to provide a more comprehensive evaluation framework. This hybrid design enables the model to assess both global context and intricate image details, better reflecting the human visual process, which first interprets overall structure before attending to fine-grained elements. The final stage employs a multilayer perceptron (MLP) to map the integrated features into a concise quality score. Experimental results demonstrate that our method achieves improved consistency with human perceptual judgments compared to existing IQA models.

[18] DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks

Yinqi Li,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 该论文提出了一种方法，利用预训练的扩散模型完成判别性任务，特别是将分类任务扩展到更复杂的物体检测任务。通过优化方法和贝叶斯规则的应用，该方法在COCO数据集上表现与基础判别性基线相当，并显著提升了速度。

Details

Motivation: 扩散模型在生成任务中表现出色，但其在判别性任务中的应用尚未充分探索。本文旨在扩展预训练扩散模型的判别能力，尤其是物体检测任务。 Method: 通过“反转”预训练的布局到图像扩散模型，提出了一种基于梯度的离散优化方法替代繁重的预测枚举过程，并引入先验分布模型以更准确地应用贝叶斯规则。 Result: 在COCO数据集上，该方法与基础判别性物体检测基线表现相当，同时显著提升了扩散分类方法的速度而不损失准确性。 Conclusion: 该方法成功将扩散模型应用于判别性任务，特别是在物体检测中表现优异，同时提升了效率。 Abstract: Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .

[19] Precision Neural Network Quantization via Learnable Adaptive Modules

Wenqiang Zhou,Zhendong Yu,Xinyu Liu,Jiaming Yang,Rong Xiao,Tao Wang,Chenwei Tang,Jiancheng Lv

Main category: cs.CV

TL;DR: 提出了一种自适应步长量化方法（ASQ），通过动态调整量化参数和非均匀量化方案，显著提升了量化感知训练（QAT）的性能。

Details

Motivation: 量化感知训练（QAT）在提高模型效率的同时，量化参数的可训练性会牺牲推理灵活性，尤其是处理分布差异大的激活值时。 Method: ASQ方法动态调整量化缩放因子，并采用基于平方根二的指数量化方案（POST），结合查找表（LUT）保持计算效率。 Result: ASQ在4位量化ResNet34模型上，ImageNet准确率提升1.2%，优于现有QAT方法，甚至接近全精度基线。 Conclusion: ASQ通过自适应量化方案有效解决了QAT的灵活性冲突，同时保持了高性能和计算效率。 Abstract: Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization parameters trainable can significantly improve the performance of QAT, but at the cost of compromising the flexibility during inference, especially when dealing with activation values with substantially different distributions. In this paper, we propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ), to resolve this conflict. Specifically, the proposed ASQ method first dynamically adjusts quantization scaling factors through a trained module capable of accommodating different activations. Then, to address the rigid resolution issue inherent in Power of Two (POT) quantization, we propose an efficient non-uniform quantization scheme. We utilize the Power Of Square root of Two (POST) as the basis for exponential quantization, effectively handling the bell-shaped distribution of neural network weights across various bit-widths while maintaining computational efficiency through a Look-Up Table method (LUT). Extensive experimental results demonstrate that the proposed ASQ method is superior to the state-of-the-art QAT approaches. Notably that the ASQ is even competitive compared to full precision baselines, with its 4-bit quantized ResNet34 model improving accuracy by 1.2\% on ImageNet.

[20] Towards Generalized and Training-Free Text-Guided Semantic Manipulation

Yu Hong,Xiao Cai,Pengpeng Zeng,Shuai Zhang,Jingkuan Song,Lianli Gao,Heng Tao Shen

Main category: cs.CV

TL;DR: 论文提出了一种名为GTF的新方法，用于文本引导的语义图像编辑，支持多种语义操作且无需训练。

Details

Motivation: 现有方法效率低、扩展性差且通用性有限，而扩散模型中噪声的几何特性与语义变化密切相关。 Method: 通过控制噪声的几何关系实现多种语义操作（如添加、移除、风格迁移），无需微调或优化。 Result: 实验证明GTF能高效生成高保真结果，支持多模态任务。 Conclusion: GTF在语义操作领域具有潜力，可推动技术发展。 Abstract: Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.

[21] EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

Haodi Yao,Fenghua He,Ning Hao,Chen Xie

Main category: cs.CV

TL;DR: EdgePoint2是一个轻量级的关键点检测和描述神经网络，专为边缘计算设计，在保持高精度的同时优化效率，并提供多种子模型以适应不同需求。

Details

Motivation: 深度学习在关键点提取中表现优异，但计算成本高且高维描述符不适用于分布式应用，因此需要紧凑且高效的方法。 Method: EdgePoint2采用优化的网络架构，结合正交Procrustes损失和相似性损失训练紧凑描述符，并提供14个子模型。 Result: 实验显示EdgePoint2在多种场景下达到SOTA精度和效率，且使用低维描述符（32/48/64）。 Conclusion: EdgePoint2是视觉任务中极具竞争力的选择，尤其适用于计算和通信受限的场景。 Abstract: The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.

[22] Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+

Meher Boulaabi,Takwa Ben Aïcha Gader,Afef Kacem Echi,Sameh Mbarek

Main category: cs.CV

TL;DR: 提出了一种针对糖尿病视网膜病变病灶的二元分割方法，通过结合多个病灶类型的模型输出，提高了分割精度，并优化了参数。

Details

Motivation: 解决糖尿病视网膜病变病灶分割中的数据集限制和标注复杂性挑战。 Method: 采用DeepLabv3+模型，结合特定预处理（裁剪和CLAHE）及数据增强技术。 Result: 分割准确率达到99%。 Conclusion: 该方法在医学图像分析中具有高效性，特别是在糖尿病视网膜病变病灶的精确分割中。 Abstract: To improve the segmentation of diabetic retinopathy lesions (microaneurysms, hemorrhages, exudates, and soft exudates), we implemented a binary segmentation method specific to each type of lesion. As post-segmentation, we combined the individual model outputs into a single image to better analyze the lesion types. This approach facilitated parameter optimization and improved accuracy, effectively overcoming challenges related to dataset limitations and annotation complexity. Specific preprocessing steps included cropping and applying contrast-limited adaptive histogram equalization to the L channel of the LAB image. Additionally, we employed targeted data augmentation techniques to further refine the model's efficacy. Our methodology utilized the DeepLabv3+ model, achieving a segmentation accuracy of 99%. These findings highlight the efficacy of innovative strategies in advancing medical image analysis, particularly in the precise segmentation of diabetic retinopathy lesions. The IDRID dataset was utilized to validate and demonstrate the robustness of our approach.

[23] DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

Zhanglin Wu,Tengfei Song,Ning Xie,Weidong Zhang,Pengfei Li,Shuang Wu,Chong Li,Junhao Zhu,Hao Yang

Main category: cs.CV

TL;DR: 华为翻译服务中心提出了一种基于多任务学习和感知链式思维的端到端文档图像机器翻译框架，结合最小贝叶斯解码和后处理策略，统一处理OCR与非OCR任务。

Details

Motivation: 解决复杂布局文档图像的端到端机器翻译问题，提升翻译系统的能力。 Method: 结合多任务学习和感知链式思维的训练框架，采用最小贝叶斯解码和后处理策略。 Result: 展示了有效的文档图像机器翻译方法，统一处理OCR与非OCR任务。 Conclusion: 提出的框架在复杂布局文档翻译任务中表现优异，具有实际应用潜力。 Abstract: This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.

[24] TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Linli Yao,Yicheng Li,Yuancheng Wei,Lei Li,Shuhuai Ren,Yuanxin Liu,Kun Ouyang,Lean Wang,Shicheng Li,Sida Li,Lingpeng Kong,Qi Liu,Yuanxing Zhang,Xu Sun

Main category: cs.CV

TL;DR: TimeChat-Online是一种新型在线视频大语言模型，通过创新的DTD模块高效处理实时视频流，减少冗余帧并保持高性能。

Details

Motivation: 在线视频平台的快速增长需要实时视频理解系统，但现有VideoLLMs在流媒体场景中因无法高效处理冗余帧而受限。 Method: 引入DTD模块，受人类视觉感知启发，保留有意义的时间变化并过滤冗余内容。 Result: DTD减少82.8%的视频标记，同时保持98%的性能；TimeChat-Online在流媒体和长视频任务中表现优越。 Conclusion: TimeChat-Online通过DTD和Proactive Response能力，为实时视频交互提供了高效解决方案。 Abstract: The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.

[25] DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

Yiyan Xu,Wuqiang Zheng,Wenjie Wang,Fengbin Zhu,Xinting Hu,Yang Zhang,Fuli Feng,Tat-Seng Chua

Main category: cs.CV

TL;DR: DRC框架通过解耦表示学习解决个性化图像生成中的风格与语义纠缠问题，提升生成效果。

Details

Motivation: 现有方法难以准确融合用户风格偏好与语义意图，导致生成图像无法保留用户偏好或反映指定语义。 Method: DRC采用解耦表示组合，分两阶段学习：解耦学习（分离风格与语义特征）和个性化建模（适应解耦表示）。 Result: 实验表明DRC在解决指导崩溃问题上表现优异，支持可控个性化图像生成。 Conclusion: 解耦表示学习对个性化图像生成至关重要，DRC为此提供了有效解决方案。 Abstract: Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods -- whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) -- struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.

[26] I-INR: Iterative Implicit Neural Representations

Ali Haider,Muhammad Salman Ali,Maryam Qamar,Tahir Khalil,Soo Ye Kim,Jihyong Oh,Enzo Tartaglione,Sung-Ho Bae

Main category: cs.CV

TL;DR: 提出了一种名为I-INRs的迭代隐式神经表示框架，通过迭代细化提升信号重建质量，解决了传统INRs在细节保留和高频信息处理上的不足。

Details

Motivation: 传统隐式神经表示（INRs）在回归问题中容易回归到均值，导致难以捕捉细节和高频信息，且对噪声敏感。 Method: 提出I-INRs框架，通过迭代细化过程增强信号重建，兼容现有INRs架构。 Result: 实验表明，I-INRs在图像恢复、去噪和物体占用预测等任务中优于WIRE、SIREN和Gauss等基线方法。 Conclusion: I-INRs显著提升了信号重建质量，尤其在细节保留和噪声鲁棒性方面表现优异。 Abstract: Implicit Neural Representations (INRs) have revolutionized signal processing and computer vision by modeling signals as continuous, differentiable functions parameterized by neural networks. However, their inherent formulation as a regression problem makes them prone to regression to the mean, limiting their ability to capture fine details, retain high-frequency information, and handle noise effectively. To address these challenges, we propose Iterative Implicit Neural Representations (I-INRs) a novel plug-and-play framework that enhances signal reconstruction through an iterative refinement process. I-INRs effectively recover high-frequency details, improve robustness to noise, and achieve superior reconstruction quality. Our framework seamlessly integrates with existing INR architectures, delivering substantial performance gains across various tasks. Extensive experiments show that I-INRs outperform baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision applications such as image restoration, image denoising, and object occupancy prediction.

[27] TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You,Wenxuan Huang,Xinni Xie,Xiangyi Wei,Bangyan Li,Shaohui Lin,Yang Li,Changbo Wang

Main category: cs.CV

TL;DR: TimeSoccer是一种端到端的足球多模态大语言模型，用于全场比赛视频的单锚点密集视频字幕生成，通过联合预测时间戳和生成字幕，解决了现有方法依赖时间先验或全局上下文不足的问题。

Details

Motivation: 现有足球多模态大语言模型依赖时间先验或采用复杂的两步范式，无法端到端处理长视频且性能不佳。 Method: 提出TimeSoccer，结合MoFA-Select帧压缩模块，通过粗到细策略自适应选择代表性帧，并采用互补训练范式增强长序列处理能力。 Result: 实验表明TimeSoccer在SDVC任务上达到最先进性能，生成高质量评论并准确对齐时间。 Conclusion: TimeSoccer为足球视频字幕生成提供了一种高效的端到端解决方案，具有准确的时间对齐和语义相关性。 Abstract: Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

[28] Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Oussema Dhaouadi,Johannes Meier,Luca Wahl,Jacques Kaiser,Luca Scalerandi,Nick Wandelburg,Zhuolun Zhou,Nijanthan Berinpanathan,Holger Banzhaf,Daniel Cremers

Main category: cs.CV

TL;DR: 论文介绍了DeepScenario Open 3D Dataset (DSC3D)，一种通过无人机跟踪获取的高质量、无遮挡的6自由度轨迹数据集，旨在提升自动驾驶系统的环境感知能力。

Details

Motivation: 传统数据集因固定传感器和遮挡问题限制了自动驾驶系统的环境建模能力，DSC3D通过无人机解决了这些问题。 Method: 采用单目相机无人机跟踪管道，捕获了14类交通参与者的175,000多条轨迹，覆盖多种复杂场景。 Result: DSC3D在多样性和规模上超越现有数据集，支持运动预测、行为建模等应用。 Conclusion: DSC3D为自动驾驶研究提供了更全面的环境表示，有望提升系统安全性和交互能力。 Abstract: Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at app.deepscenario.com, facilitating research in motion prediction, behavior modeling, and safety validation.

[29] SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

Yiming Zhao,Guorong Li,Laiyun Qing,Amin Beheshti,Jian Yang,Michael Sheng,Yuankai Qi,Qingming Huang

Main category: cs.CV

TL;DR: SDVPT框架通过语义驱动的视觉提示调优，提升预训练视觉语言模型在开放世界物体计数中的泛化能力。

Details

Motivation: 现有方法在训练类别上表现良好，但对未见类别的泛化能力有限。 Method: 提出两阶段视觉提示学习策略（CSPI和TGPR），动态合成未见类别的视觉提示。 Result: 在FSC-147、CARPK和PUCPR+数据集上验证了SDVPT的有效性和适应性。 Conclusion: SDVPT以最小参数和推理时间开销，显著提升模型对未见类别的计数能力。 Abstract: Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.

[30] Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models

Francesc Marti-Escofet,Benedikt Blumenstiel,Linus Scheibenreif,Paolo Fraccaro,Konrad Schindler

Main category: cs.CV

TL;DR: 论文探讨了参数高效微调（PEFT）技术在地球观测（EO）领域中的应用，通过实验验证其在减少计算资源需求的同时保持或超越全微调性能的能力。

Details

Motivation: 随着基础模型规模增大，全微调的计算成本和资源需求限制了其可访问性和扩展性，且可能导致预训练特征遗忘和泛化能力下降。PEFT技术为解决这些问题提供了可能。 Method: 通过多种基础模型架构和PEFT技术，在五个不同的EO数据集上进行广泛实验，评估其有效性。 Result: PEFT技术在性能上匹配甚至超越全微调，同时减少训练时间和内存需求，并增强模型对未见地理区域的泛化能力。推荐使用UNet解码器且不包含元数据的配置。 Conclusion: PEFT技术是高效、可扩展且成本效益高的模型适应方案，相关模型和技术已集成到开源工具TerraTorch中。 Abstract: Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.

[31] S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception

Sven Teufel,Jörg Gamerdinger,Oliver Bringmann

Main category: cs.CV

TL;DR: S2S-Net解决了自动驾驶中集体感知的Sensor2Sensor域差距问题，并在SCOPE数据集上实现了最先进的性能。

Details

Motivation: 解决Connected and Automated Vehicles (CAVs)中因不同传感器系统导致的Sensor2Sensor域差距问题。 Method: 提出了传感器域鲁棒架构S2S-Net，并在SCOPE数据集上进行了深入分析。 Result: S2S-Net在未见过的传感器域中保持高性能，并在SCOPE数据集上达到最先进水平。 Conclusion: S2S-Net是首个解决V2V集体感知中Sensor2Sensor域差距的方法，具有显著的实际应用价值。 Abstract: Collective Perception (CP) has emerged as a promising approach to overcome the limitations of individual perception in the context of autonomous driving. Various approaches have been proposed to realize collective perception; however, the Sensor2Sensor domain gap that arises from the utilization of different sensor systems in Connected and Automated Vehicles (CAVs) remains mostly unaddressed. This is primarily due to the paucity of datasets containing heterogeneous sensor setups among the CAVs. The recently released SCOPE datasets address this issue by providing data from three different LiDAR sensors for each CAV. This study is the first to tackle the Sensor2Sensor domain gap in vehicle to vehicle (V2V) collective perception. First, we present our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is conducted. S2S-Net demonstrates the capability to maintain very high performance in unseen sensor domains and achieved state-of-the-art results on the SCOPE dataset.

[32] StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

Xu Wang,Jialang Xu,Shuai Zhang,Baoru Huang,Danail Stoyanov,Evangelos B. Mazomenos

Main category: cs.CV

TL;DR: StereoMamba架构通过FE-Mamba和MFF模块优化了RAMIS中的立体视差估计，实现了精度、鲁棒性和速度的最佳平衡。

Details

Motivation: 当前深度学习方法在RAMIS中的立体视差估计中难以平衡精度、鲁棒性和推理速度，StereoMamba旨在解决这一问题。 Method: 提出FE-Mamba模块增强长程空间依赖，结合MFF模块融合多尺度特征。 Result: 在SCARED基准测试中表现优异（EPE 2.64 px，深度MAE 2.55 mm），推理速度21.28 FPS，并在SSIM和PSNR指标上表现最佳。 Conclusion: StereoMamba在精度、鲁棒性和效率上达到最优平衡，并展示了强大的零样本泛化能力。 Abstract: Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.

[33] 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

Min Wei,Chaohui Yu,Jingkai Zhou,Fan Wang

Main category: cs.CV

TL;DR: 3DV-TON是一个基于扩散的框架，用于生成高质量且时间一致的视频试穿效果，通过动态3D网格和矩形掩码策略解决现有方法的不足。

Details

Motivation: 现有视频试穿方法在处理复杂服装图案和多样身体姿势时难以保证高质量和时间一致性。 Method: 使用生成的动画纹理3D网格作为帧级指导，结合自适应管道生成动态3D指导，并引入矩形掩码策略减少伪影传播。 Result: 在HR-VVT数据集上，3DV-TON在定量和定性评估中均优于现有方法。 Conclusion: 3DV-TON通过3D指导和掩码策略显著提升了视频试穿的质量和一致性，为研究提供了新基准。 Abstract: Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link https://2y7c3.github.io/3DV-TON/

[34] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Tiancheng Gu,Kaicheng Yang,Ziyong Feng,Xingjun Wang,Yanzhao Zhang,Dingkun Long,Yingda Chen,Weidong Cai,Jiankang Deng

Main category: cs.CV

TL;DR: UniME是一种新的两阶段框架，利用多模态大语言模型（MLLMs）学习可迁移的多模态表示，解决了CLIP框架的局限性，并在多个任务中表现优异。

Details

Motivation: CLIP框架在图像-文本检索和聚类中存在文本截断、孤立编码和组合性不足的问题，而MLLMs的潜力尚未充分挖掘。 Method: UniME通过两阶段方法：1) 从LLM教师模型进行文本知识蒸馏；2) 引入硬负样本增强的指令调优，提升判别性和组合性。 Result: 在MMEB基准和多个检索任务中，UniME表现一致优于现有方法，展示了更强的判别和组合能力。 Conclusion: UniME通过结合知识蒸馏和硬负样本调优，显著提升了多模态表示学习的性能，为下游任务提供了更强大的基础。 Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

[35] Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

Mingxuan Wu,Huang Huang,Justin Kerr,Chung Min Kim,Anthony Zhang,Brent Yi,Angjoo Kanazawa

Main category: cs.CV

TL;DR: POD框架通过预测-优化-蒸馏的循环自我提升，实现更好的4D物体理解。

Details

Motivation: 人类通过长时间观察提升对物体3D状态的预测能力，现有系统依赖多视角观察或监督数据。POD旨在通过自我改进框架解决这一问题。 Method: POD结合预测、优化和蒸馏的循环过程，利用多视角扫描和单目视频，通过逆渲染和自生成数据提升模型。 Result: POD在真实和合成物体上表现优于纯优化基线，性能随视频长度和迭代次数提升。 Conclusion: POD展示了通过观察时间和循环优化提升性能的潜力，适用于复杂关节和多体配置。 Abstract: Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.

[36] FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

De-An Huang,Subhashree Radhakrishnan,Zhiding Yu,Jan Kautz

Main category: cs.CV

TL;DR: 论文提出了一种名为FRAG的框架，通过选择输入中的相关帧而非处理长上下文，显著提升了大型多模态模型在长视频和多页文档任务中的性能。

Details

Motivation: 当前长上下文多模态模型因计算成本高而受限，作者探索了一种无需长上下文处理的方法。 Method: FRAG框架独立评分每帧，通过Top-K选择最高分帧，仅基于选定帧生成输出。 Result: 实验表明，FRAG显著提升了模型性能，在长视频和文档任务中达到SOTA水平。 Conclusion: FRAG是一种简单有效的框架，适用于现有模型且无需微调，显著提升了长输入任务的表现。 Abstract: There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG

[37] Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

Zhiying Li,Yeying Jin,Fan Shen,Zhi Liu,Weibin Chen,Pengju Zhang,Xiaomei Zhang,Boyu Chen,Michael Shen,Kejian Wu,Zhaoxin Fan,Jin Dong

Main category: cs.CV

TL;DR: 论文提出了一种名为Tangible Attack (TBA)的新框架，通过Dual Heterogeneous Noise Generator (DHNG)和自定义对抗损失函数，显著提升了对抗攻击在数字人生成模型中的效果。

Details

Motivation: 现有研究主要关注减少估计误差，但忽视了鲁棒性和安全性，导致系统易受对抗攻击。 Method: 提出TBA框架，结合DHNG（利用VAE和ControlNet生成多样化噪声）和自定义对抗损失函数，通过多梯度信号迭代优化对抗样本。 Result: 实验显示TBA显著提升了对抗攻击效果，估计误差增加41.0%，平均提升约17.0%。 Conclusion: 当前EHPS模型存在严重安全漏洞，需加强数字人生成系统的防御能力。 Abstract: Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.

[38] Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data

Weiran Pan,Wei Wei,Feida Zhu,Yong Deng

Main category: cs.CV

TL;DR: 提出一种基于预测置信度趋势的新样本选择方法，用于解决噪声标签下图像分类中的样本选择问题。

Details

Motivation: 现有方法通常将小损失样本视为正确标签，但部分正确标签样本因难以学习而损失较高，导致样本选择在精度和召回率之间存在权衡。 Method: 通过跟踪模型预测置信度趋势（使用Mann-Kendall检验），区分正确标签但难学习的样本与错误标签样本。 Result: 实验表明，该方法能有效提升现有噪声标签学习方法的性能。 Conclusion: 基于置信度趋势的方法缓解了样本选择的权衡问题，并可作为插件集成到现有技术中。 Abstract: We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model's prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.

[39] RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Aviv Slobodkin,Hagai Taitelbaum,Yonatan Bitton,Brian Gordon,Michal Sokolik,Nitzan Bitton Guetta,Almog Gueta,Royi Rassin,Itay Laish,Dani Lischinski,Idan Szpektor

Main category: cs.CV

TL;DR: 论文提出了RefVNLI，一种低成本评估指标，用于同时评估文本对齐和主题保留，优于现有方法。

Details

Motivation: 现有评估方法要么仅关注文本对齐或主题保留，要么与人类判断不一致，或依赖高成本的API评估。 Method: 利用视频推理基准和图像扰动生成的大规模数据集训练RefVNLI。 Result: RefVNLI在多个基准和主题类别中表现优异，文本对齐提升6.4分，主题一致性提升8.5分，且与人类偏好一致性达87%。 Conclusion: RefVNLI是一种高效且低成本的评估指标，适用于主题驱动的文本到图像生成任务。 Abstract: Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.

[40] Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation

Zihan Cheng,Jintao Guo,Jian Zhang,Lei Qi,Luping Zhou,Yinghuan Shi,Yang Gao

Main category: cs.CV

TL;DR: 该论文提出了一种基于Mamba架构的新框架Mamba-Sea，用于解决医学图像分割中的域泛化问题，通过全局到局部的序列增强提升模型在域偏移下的泛化能力。

Details

Motivation: 现有域泛化方法主要基于CNN或ViT架构，而Mamba模型因其长程依赖捕捉能力和线性复杂度在医学图像分割中表现出色，因此探索其在域泛化中的应用潜力。 Method: 提出Mamba-Sea框架，结合全局和局部序列增强机制：全局增强模拟不同站点间的外观变化，抑制域特定信息学习；局部增强通过扰动连续子序列的令牌风格，提升模型对域偏移的鲁棒性。 Result: Mamba-Sea在Prostate数据集上首次突破90%的Dice系数，超过之前的SOTA（88.61%）。 Conclusion: Mamba-Sea是首个探索Mamba在医学图像分割域泛化中的工作，展示了其在域偏移下的强大鲁棒性和潜力。 Abstract: To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.

[41] Towards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse Scenarios

Anyi Xiao,Cihui Yang

Main category: cs.CV

TL;DR: TableCenterNet是一种单阶段端到端表格结构解析网络，统一了表格空间和逻辑结构的预测，通过共享特征提取层和任务特定解码的协同架构，实现了高效且鲁棒的表格解析。

Details

Motivation: 现有方法在跨场景适应性、鲁棒性和计算效率之间难以平衡，TableCenterNet旨在解决这一问题。 Method: 提出单阶段端到端网络TableCenterNet，统一预测表格空间和逻辑结构，通过共享特征提取层和任务特定解码实现高效解析。 Result: 在TableGraph-24k数据集上达到最先进性能，且训练和推理效率更高。 Conclusion: TableCenterNet在表格结构解析中表现出高效性和鲁棒性，适用于多样化场景。 Abstract: Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at https://github.com/dreamy-xay/TableCenterNet.

[42] ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting

Junyan Zhang,Yan Li,Mengxiao Geng,Liu Shi,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于编码策略的扩散模型，用于小样本学习的彩色图像修复，通过虚拟掩码和高维对象构建，提升了修复质量和细节保留。

Details

Motivation: 传统图像修复方法难以保留复杂细节，而深度学习模型需要大量训练数据。本文旨在解决这些问题，提出一种小样本学习方法。 Method: 采用编码策略，利用虚拟掩码构建高维对象，结合低秩方法和扩散模型，实现有限样本下的高质量修复。 Result: 实验表明，该方法在定量指标上优于现有技术，修复图像在纹理和结构完整性上表现更优。 Conclusion: 该方法通过编码策略和扩散模型，实现了小样本下的高质量图像修复，具有实际应用潜力。 Abstract: Image inpainting is a technique used to restore missing or damaged regions of an image. Traditional methods primarily utilize information from adjacent pixels for reconstructing missing areas, while they struggle to preserve complex details and structures. Simultaneously, models based on deep learning necessitate substantial amounts of training data. To address this challenge, an encoding strategy-inspired diffusion model with few-shot learning for color image inpainting is proposed in this paper. The main idea of this novel encoding strategy is the deployment of a "virtual mask" to construct high-dimensional objects through mutual perturbations between channels. This approach enables the diffusion model to capture diverse image representations and detailed features from limited training samples. Moreover, the encoding strategy leverages redundancy between channels, integrates with low-rank methods during iterative inpainting, and incorporates the diffusion model to achieve accurate information output. Experimental results indicate that our method exceeds current techniques in quantitative metrics, and the reconstructed images quality has been improved in aspects of texture and structural integrity, leading to more precise and coherent results.

[43] Text-to-Image Alignment in Denoising-Based Models through Step Selection

Paul Grimal,Hervé Le Borgne,Olivier Ferret

Main category: cs.CV

TL;DR: 提出了一种在关键去噪步骤选择性增强信号的方法，优化基于输入语义的图像生成。

Details

Motivation: 解决视觉生成AI模型中文本-图像对齐和推理限制的问题。 Method: 在去噪过程的后期阶段调整信号，而非早期阶段。 Result: 在Diffusion和Flow Matching模型上实现了最先进的性能，生成语义对齐的图像。 Conclusion: 选择合适的采样阶段对提升性能和图像对齐至关重要。 Abstract: Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.

[44] An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization Algorithm

Ahmadreza Shateri,Negar Nourani,Morteza Dorrigiv,Hamid Nasiri

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的框架，用于从皮肤病变图像中自动检测猴痘，结合迁移学习、降维和优化算法，取得了高准确率和可解释性。

Details

Motivation: 猴痘的全球传播引发了公共卫生担忧，早期准确诊断对疾病管理至关重要。 Method: 使用Xception架构提取特征，PCA降维，NGBoost分类，并引入AVOA优化超参数。 Result: 模型准确率达97.53%，F1分数97.72%，AUC 97.47%。 Conclusion: 该框架为资源有限环境提供了高效诊断工具，支持早期检测。 Abstract: The recent global spread of monkeypox, particularly in regions where it has not historically been prevalent, has raised significant public health concerns. Early and accurate diagnosis is critical for effective disease management and control. In response, this study proposes a novel deep learning-based framework for the automated detection of monkeypox from skin lesion images, leveraging the power of transfer learning, dimensionality reduction, and advanced machine learning techniques. We utilize the newly developed Monkeypox Skin Lesion Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to train and evaluate our models. The proposed framework employs the Xception architecture for deep feature extraction, followed by Principal Component Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting (NGBoost) algorithm for classification. To optimize the model's performance and generalization, we introduce the African Vultures Optimization Algorithm (AVOA) for hyperparameter tuning, ensuring efficient exploration of the parameter space. Our results demonstrate that the proposed AVOA-NGBoost model achieves state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72% and an AUC of 97.47%. Additionally, we enhance model interpretability using Grad-CAM and LIME techniques, providing insights into the decision-making process and highlighting key features influencing classification. This framework offers a highly precise and efficient diagnostic tool, potentially aiding healthcare providers in early detection and diagnosis, particularly in resource-constrained environments.

[45] When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering

Keyang Ye,Tianjia Shao,Kun Zhou

Main category: cs.CV

TL;DR: Gaussian-enhanced Surfels (GESs) 是一种双尺度表示方法，结合2D不透明surfels和3D高斯分布，用于快速高保真辐射场渲染。

Details

Motivation: 解决现有方法在快速渲染高保真场景时的不足，如视图变化下的伪影问题。 Method: 通过两阶段渲染（surfels光栅化和高斯分布拼接）和粗到细优化过程，实现高效且一致的渲染。 Result: GESs在速度和图像质量上优于现有方法，并支持多种扩展（如抗锯齿、加速渲染等）。 Conclusion: GESs是一种高效且灵活的表示方法，适用于超快速高保真辐射场渲染。 Abstract: We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes -- surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.

[46] A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

Jiaqi Deng,Zonghan Wu,Huan Huo,Guandong Xu

Main category: cs.CV

TL;DR: 这篇论文是关于知识驱动的视觉问答（KB-VQA）的综述，系统整理了现有方法，并提出了知识表示、检索和推理的分类框架。

Details

Motivation: KB-VQA结合视觉、文本和广泛知识，具有重要应用价值，但目前缺乏系统性综述。 Method: 通过建立分类法，将KB-VQA系统分为知识表示、检索和推理三阶段，并探讨知识整合技术。 Result: 论文总结了现有方法，识别了挑战，并提出了未来研究方向。 Conclusion: 该综述为KB-VQA模型的进一步发展奠定了基础。 Abstract: Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.

[47] Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior

Lin Che,Yizi Chen,Tanhua Jin,Martin Raubal,Konrad Schindler,Peter Kiefer

Main category: cs.CV

TL;DR: 提出一种基于街景图像的无监督对比聚类模型，结合地理先验，用于复杂城市环境中的土地利用分类与制图。

Details

Motivation: 现有遥感技术因缺乏地面细节在复杂城市环境中精度不足，而街景图像能捕捉更多人类和社会活动信息，但现有监督分类方法受限于高质量标注数据的稀缺性和泛化能力不足。 Method: 引入无监督对比聚类模型，结合地理先验，并通过简单视觉分配聚类结果，提供灵活的土地利用制图方案。 Result: 实验表明，该方法可从两个城市的街景图像数据集中生成土地利用地图，且具有适应性和可扩展性。 Conclusion: 该方法基于地理空间数据的空间一致性，适用于街景图像可用的多种场景，支持无监督土地利用制图与更新。 Abstract: Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at https://github.com/lin102/CCGP.

[48] Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images

Zebo Huang,Yinghui Wang

Main category: cs.CV

TL;DR: 提出了一种针对内窥镜场景的自监督单目深度估计网络，通过遮挡感知框架和语义分割改进深度重建质量。

Details

Motivation: 现有方法假设光照一致，但内窥镜场景中动态光照和遮挡导致几何解释错误，自监督信号不可靠。 Method: 引入遮挡感知框架，包括遮挡掩码数据增强和基于非负矩阵分解的语义分割生成伪标签。 Result: 在SCARED数据集上达到SOTA性能，并在Endo-SLAM和SERV-CT数据集上表现出强泛化能力。 Conclusion: 该方法显著提升了内窥镜场景下的自监督深度估计性能。 Abstract: We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self-supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion-aware self-supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo-labels by simulating viewpoint-dependent occlusion scenarios. This enhances the model's ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non-negative matrix factorization, clustering convolutional activations to generate pseudo-labels in texture-deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state-of-the-art performance in self-supervised depth estimation. Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate strong generalization across diverse endoscopic environments.

[49] Tamper-evident Image using JPEG Fixed Points

Zhaofeng Si,Siwei Lyu

Main category: cs.CV

TL;DR: JPEG压缩过程中存在固定点，重复压缩和解压后图像不再变化。本文证明了这些固定点的存在，并利用其开发了一种防篡改图像方法。

Details

Motivation: 研究JPEG压缩中的固定点现象，探索其在图像防篡改中的应用。 Method: 分析JPEG压缩和解压过程，证明固定点的存在，并开发基于固定点的防篡改方法。 Result: 固定点存在且多样，能保持图像视觉质量，可用于检测篡改操作。 Conclusion: 固定点现象为图像防篡改提供了新思路，具有实际应用潜力。 Abstract: An intriguing phenomenon about JPEG compression has been observed since two decades ago- after repeating JPEG compression and decompression, it leads to a stable image that does not change anymore, which is a fixed point. In this work, we prove the existence of fixed points in the essential JPEG procedures. We analyze JPEG compression and decompression processes, revealing the existence of fixed points that can be reached within a few iterations. These fixed points are diverse and preserve the image's visual quality, ensuring minimal distortion. This result is used to develop a method to create a tamper-evident image from the original authentic image, which can expose tampering operations by showing deviations from the fixed point image.

[50] RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network

Boyue Xu,Yi Xu,Ruichao Hou,Jia Bei,Tongwei Ren,Gangshan Wu

Main category: cs.CV

TL;DR: HMAD网络通过分层模态聚合与分布，提升了RGB-D跟踪的鲁棒性和实时性能。

Details

Motivation: 当前RGB-D跟踪器效率低且仅关注单层特征，导致融合鲁棒性差、速度慢，无法满足实际应用需求。 Method: 提出HMAD网络，利用RGB和深度模态的独特特征表示优势，采用分层方法进行特征分布与融合。 Result: 在多个RGB-D数据集上实现最优性能，并在实时场景中有效应对多种跟踪挑战。 Conclusion: HMAD显著提升了RGB-D跟踪的鲁棒性和实时性，适用于实际应用。 Abstract: The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD's capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.

[51] STCL:Curriculum learning Strategies for deep learning image steganography models

Fengchun Liu,Tong Zhang,Chunying Zhang

Main category: cs.CV

TL;DR: 提出了一种基于课程学习的图像隐写训练策略（STCL），通过逐步从简单到复杂的图像训练，提升模型性能。

Details

Motivation: 解决深度学习图像隐写模型中图像质量差和网络收敛慢的问题。 Method: 1. 基于教师模型的难度评估策略；2. 基于拐点的训练调度策略。 Result: 在多个数据集上验证，模型性能显著提升，隐写图像质量高且分析得分低。 Conclusion: STCL策略有效提升了图像隐写模型的训练效率和性能。 Abstract: Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{https://github.com/chaos-boops/STCL}{https://github.com/chaos-boops/STCL}.

[52] Enhancing CNNs robustness to occlusions with bioinspired filters for border completion

Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi

Main category: cs.CV

TL;DR: 利用视觉皮层边界补全的数学模型改进CNN滤波器，在遮挡MNIST图像测试中显著提升性能。

Details

Motivation: 探索视觉皮层机制对CNN性能的潜在提升作用。 Method: 基于视觉皮层边界补全的数学模型设计自定义滤波器，改进LeNet 5。 Result: 在遮挡MNIST图像测试中，准确率显著提高。 Conclusion: 视觉皮层机制建模可有效提升CNN性能。 Abstract: We exploit the mathematical modeling of the visual cortex mechanism for border completion to define custom filters for CNNs. We see a consistent improvement in performance, particularly in accuracy, when our modified LeNet 5 is tested with occluded MNIST images.

[53] Improving Open-World Object Localization by Discovering Background

Ashish Singh,Michael J. Jones,Kuan-Chuan Peng,Anoop Cherian,Moitreya Chatterjee,Erik Learned-Miller

Main category: cs.CV

TL;DR: 提出一种利用背景信息指导目标定位的新框架，通过识别非判别性区域来提升开放世界中的目标定位性能。

Details

Motivation: 解决开放世界中目标定位问题，即在训练时仅使用有限类别边界框信息，推理时定位所有类别（包括未见类别）的目标。 Method: 提出一种框架，通过发现图像中的背景区域（非判别性区域），并训练目标提议网络避免在这些区域检测目标。 Result: 在标准基准测试中表现优于现有方法。 Conclusion: 利用背景信息能有效提升开放世界目标定位性能。 Abstract: Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.

[54] A Guide to Structureless Visual Localization

Vojtech Panek,Qunjie Zhou,Yaqing Ding,Sérgio Agostinho,Zuzana Kukelova,Torsten Sattler,Laura Leal-Taixé

Main category: cs.CV

TL;DR: 本文综述并比较了无结构视觉定位方法，发现基于经典几何推理的方法在姿态精度上优于基于姿态回归的方法，但灵活性以稍低的精度为代价。

Details

Motivation: 视觉定位算法在自动驾驶和增强/混合现实等应用中至关重要，但现有基于结构的方法灵活性不足，无结构方法研究较少，本文旨在填补这一空白。 Method: 通过数据库存储已知姿态的图像，利用2D-2D对应关系进行相机姿态估计，并对比不同无结构方法的性能。 Result: 实验表明，基于经典几何推理的方法（如绝对或半广义相对姿态估计）在姿态精度上显著优于基于姿态回归的方法。 Conclusion: 无结构方法在灵活性上优于基于结构的方法，但姿态精度稍低，为未来研究提供了方向。 Abstract: Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.

[55] CLIPSE -- a minimalistic CLIP-based image search engine for research

Steve Göring

Main category: cs.CV

TL;DR: CLIPSE是一个自托管的图像搜索引擎，主要用于研究，利用CLIP嵌入处理图像和文本查询，设计简单易扩展。

Details

Motivation: 为研究提供一个简单且可扩展的图像搜索解决方案。 Method: 使用CLIP嵌入处理图像和文本查询，设计简洁框架。 Result: 在小数据集上表现良好，大数据集需分布式处理。 Conclusion: CLIPSE适用于小规模研究，大规模应用需分布式扩展。 Abstract: A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.

[56] DiMeR: Disentangled Mesh Reconstruction Model

Lutao Jiang,Jiantao Lin,Kanghao Chen,Wenhang Ge,Xin Yang,Yifan Jiang,Yuanhuiyi Lyu,Xu Zheng,Yingcong Chen

Main category: cs.CV

TL;DR: DiMeR是一种新型的双流解耦模型，通过分离几何和纹理输入与框架，显著提升了稀疏视图网格重建的性能。

Details

Motivation: RGB图像在几何重建中可能导致训练目标冲突且缺乏清晰度，因此需要一种更有效的方法来分离几何和纹理信息。 Method: DiMeR将输入和框架解耦为几何和纹理两部分，几何分支使用法线图作为输入，纹理分支使用RGB图像，并改进了网格提取算法。 Result: DiMeR在稀疏视图重建、单图像转3D和文本转3D任务中表现优异，Chamfer Distance在GSO和OmniObject3D数据集上提升超过30%。 Conclusion: DiMeR通过解耦几何和纹理，显著降低了训练难度并提升了重建性能，为3D生成模型提供了新思路。 Abstract: With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.

[57] PICO: Reconstructing 3D People In Contact with Objects

Alpár Cseke,Shashank Tripathi,Sai Kumar Dwivedi,Arjun Lakshmipathy,Agniv Chatterjee,Michael J. Black,Dimitrios Tzionas

Main category: cs.CV

TL;DR: 论文提出了一种从单张彩色图像恢复3D人-物交互（HOI）的方法，通过构建新数据集PICO-db和开发优化方法PICO-fit，解决了深度模糊、遮挡和物体多样性问题。

Details

Motivation: 现有方法需要已知物体形状和接触信息，且仅适用于有限物体类别，无法泛化到自然图像和新物体类别。 Method: 1. 构建PICO-db数据集，利用视觉基础模型和2点击投影方法标注密集3D接触；2. 提出PICO-fit方法，通过渲染-比较优化拟合3D人体和物体网格。 Result: PICO-fit能够处理多种现有方法无法应对的物体类别，显著提升了HOI理解的泛化能力。 Conclusion: 该方法为自然场景下的HOI理解提供了可扩展的解决方案，数据集和代码已开源。 Abstract: Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at https://pico.is.tue.mpg.de.

[58] Hierarchical and Multimodal Data for Daily Activity Understanding

Ghazal Kaviani,Yavuz Yarici,Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib,Mashhour Solh,Ameya Patil

Main category: cs.CV

TL;DR: DARai是一个多模态、分层标注的数据集，用于研究真实环境中的人类活动，包含50名参与者在10种环境中的200小时数据，涵盖20种传感器。

Details

Motivation: 理解人类活动的复杂性，并支持多模态传感器融合和分层活动识别的研究。 Method: 构建包含脚本和非脚本记录的多模态数据集，采用三层分层标注（L1活动、L2动作、L3步骤），并设计实验验证其价值。 Result: 实验展示了DARai在识别、时间定位和未来动作预测等任务中的潜力，并揭示了单模态传感器的局限性。 Conclusion: DARai为人类活动研究提供了丰富资源，支持多模态和分层分析，未来可进一步探索其应用。 Abstract: Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: https://alregib.ece.gatech.edu/software-and-datasets/darai-daily-activity-recordings-for-artificial-intelligence-and-machine-learning/

[59] Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields

Zhuo He,Paul Henderson,Nicolas Pugeault

Main category: cs.CV

TL;DR: 本文提出了一种基于生成场理论的新方法，改进了StyleGAN的图像编辑能力，通过通道风格潜在空间S实现解耦控制。

Details

Motivation: StyleGAN生成的图像特征控制困难，现有方法在W潜在空间中的表达受限且需要预训练。 Method: 引入生成场理论，利用通道风格潜在空间S和CNN的内在结构特征，实现特征合成的解耦控制。 Result: 提出的方法能够更直接地控制StyleGAN生成图像的特征，避免了预训练的限制。 Conclusion: 生成场理论和通道风格潜在空间S为StyleGAN的图像编辑提供了更灵活和高效的解决方案。 Abstract: StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.

[60] DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model

Zhanwen Liu,Sai Zhou,Yuchao Dai,Yang Wang,Yisheng An,Xiangmo Zhao

Main category: cs.CV

TL;DR: DPMambaIR提出了一种新型All-in-One图像修复框架，通过细粒度建模和高效全局整合，解决了多任务冲突和高频细节丢失问题。

Details

Motivation: 传统方法需为每种图像退化设计专用模型，成本高且复杂。现有方法缺乏细粒度建模且难以平衡多任务冲突。 Method: 结合Degradation-Aware Prompt State Space Model (DP-SSM)和High-Frequency Enhancement Block (HEB)，实现细粒度建模和高频信息补充。 Result: 在包含七种退化类型的混合数据集上，DPMambaIR表现最佳，PSNR为27.69dB，SSIM为0.893。 Conclusion: DPMambaIR展示了作为统一All-in-One图像修复解决方案的潜力和优越性。 Abstract: All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model's adaptability to diverse degradation types. Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.

[61] EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor

Akhil Padmanabha,Saravanan Govindarajan,Hwanmun Kim,Sergio Ortiz,Rahul Rajan,Doruk Senkal,Sneha Kadetotad

Main category: cs.CV

TL;DR: 论文提出了一种资源高效的机器学习算法EgoCHARM，用于通过头戴式IMU识别高低层次活动，性能优于现有方法。

Details

Motivation: 现有方法在头戴式活动识别中性能低或资源消耗大，EgoCHARM旨在解决这一问题。 Method: 采用分层半监督学习策略，利用高层次活动标签训练，学习通用低层次运动嵌入。 Result: 在9种高层次和3种低层次活动识别中，F1分数分别为0.826和0.855，模型参数极少。 Conclusion: EgoCHARM展示了头戴式IMU在活动识别中的潜力，同时分析了其机会与限制。 Abstract: Human activity recognition (HAR) on smartglasses has various use cases, including health/fitness tracking and input for context-aware AI assistants. However, current approaches for egocentric activity recognition suffer from low performance or are resource-intensive. In this work, we introduce a resource (memory, compute, power, sample) efficient machine learning algorithm, EgoCHARM, for recognizing both high level and low level activities using a single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our hierarchical algorithm employs a semi-supervised learning strategy, requiring primarily high level activity labels for training, to learn generalizable low level motion embeddings that can be effectively utilized for low level activity recognition. We evaluate our method on 9 high level and 3 low level activities achieving 0.826 and 0.855 F1 scores on high level and low level activity recognition respectively, with just 63k high level and 22k low level model parameters, allowing the low level encoder to be deployed directly on current IMU chips with compute. Lastly, we present results and insights from a sensitivity analysis and highlight the opportunities and limitations of activity recognition using egocentric IMUs.

[62] Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu,Yucheng Han,Peng Xing,Fukun Yin,Rui Wang,Wei Cheng,Jiaqi Liao,Yingming Wang,Honghao Fu,Chunrui Han,Guopeng Li,Yuang Peng,Quan Sun,Jingwei Wu,Yan Cai,Zheng Ge,Ranchen Ming,Lei Xia,Xianfang Zeng,Yibo Zhu,Binxing Jiao,Xiangyu Zhang,Gang Yu,Daxin Jiang

Main category: cs.CV

TL;DR: 论文提出了一种名为Step1X-Edit的开源图像编辑模型，旨在缩小与闭源模型（如GPT-4o和Gemini2 Flash）的性能差距，并在新基准GEdit-Bench上表现出色。

Details

Motivation: 尽管多模态模型在图像编辑领域取得了显著进展，但开源算法与闭源模型之间仍存在较大差距。本文旨在填补这一差距。 Method: 采用多模态LLM处理参考图像和用户编辑指令，提取潜在嵌入并与扩散图像解码器结合生成目标图像。通过数据生成管道构建高质量训练数据集。 Result: Step1X-Edit在GEdit-Bench上显著优于现有开源基线，并接近领先闭源模型的性能。 Conclusion: Step1X-Edit为图像编辑领域做出了重要贡献，缩小了开源与闭源模型的性能差距。 Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

[63] The Fourth Monocular Depth Estimation Challenge

Anton Obukhov,Matteo Poggi,Fabio Tosi,Ripudaman Singh Arora,Jaime Spencer,Chris Russell,Simon Hadfield,Richard Bowden,Shuaihang Wang,Zhenxin Ma,Weijie Chen,Baobei Xu,Fengyu Sun,Di Xie,Jiang Zhu,Mykola Lavreniuk,Haining Guan,Qun Wu,Yupei Zeng,Chao Lu,Huanran Wang,Guangyuan Zhou,Haotian Zhang,Jianxiong Wang,Qiang Rao,Chunjie Wang,Xiao Liu,Zhiqiang Lou,Hualie Jiang,Yihao Chen,Rui Xu,Minglang Tan,Zihan Qin,Yifan Mao,Jiayang Liu,Jialei Xu,Yifan Yang,Wenbo Zhao,Junjun Jiang,Xianming Liu,Mingshuai Zhao,Anlong Ming,Wu Chen,Feng Xue,Mengying Yu,Shida Gao,Xiangfeng Wang,Gbenga Omotara,Ramy Farag,Jacket Demby,Seyed Mohamad Ali Tousi,Guilherme N DeSouza,Tuan-Anh Yang,Minh-Quang Nguyen,Thien-Phuc Tran,Albert Luginov,Muhammad Shahzad

Main category: cs.CV

TL;DR: 第四届单目深度估计挑战赛（MDEC）聚焦于零样本泛化到SYNS-Patches基准，改进了评估协议和基线方法，最终提交的24个方法中有10个报告了方法描述，获胜者将3D F-Score从22.58%提升至23.05%。

Details

Motivation: 挑战赛旨在推动单目深度估计技术在自然和室内复杂环境中的零样本泛化能力。 Method: 修订了评估协议，采用最小二乘对齐和两自由度支持视差和仿射不变预测，并更新基线方法，包括Depth Anything v2和Marigold。 Result: 24个提交方法超越了基线，其中10个提供了方法描述，获胜方法将3D F-Score从22.58%提升至23.05%。 Conclusion: 挑战赛展示了仿射不变预测方法的优势，并推动了单目深度估计技术的进步。 Abstract: This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.

[64] Dynamic Camera Poses and Where to Find Them

Chris Rockwell,Joseph Tung,Tsung-Yi Lin,Ming-Yu Liu,David F. Fouhey,Chen-Hsuan Lin

Main category: cs.CV

TL;DR: DynPose-100K是一个大规模动态互联网视频数据集，标注了相机位姿，解决了现有方法在动态视频位姿估计中的挑战。

Details

Motivation: 动态互联网视频的相机位姿标注对视频生成和仿真等领域至关重要，但现有数据集难以满足需求。 Method: 结合任务特定和通用模型进行视频筛选，并采用点跟踪、动态掩码和运动结构重建等最新技术进行位姿估计。 Result: DynPose-100K数据集规模大且多样性高，优于现有方法。 Conclusion: 该数据集为下游应用提供了新的研究机会。 Abstract: Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

[65] Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Xu Ma,Peize Sun,Haoyu Ma,Hao Tang,Chih-Yao Ma,Jialiang Wang,Kunpeng Li,Xiaoliang Dai,Yujun Shi,Xuan Ju,Yushi Hu,Artsiom Sanakoyeu,Felix Juefei-Xu,Ji Hou,Junjiao Tian,Tao Xu,Tingbo Hou,Yen-Cheng Liu,Zecheng He,Zijian He,Matt Feiszli,Peizhao Zhang,Peter Vajda,Sam Tsai,Yun Fu

Main category: cs.CV

TL;DR: Token-Shuffle方法通过减少Transformer中的图像标记数量，提升了自回归模型在图像合成中的效率与分辨率，首次实现2048x2048分辨率的高质量生成。

Details

Motivation: 自回归模型在图像合成中因标记数量多而效率低下，Token-Shuffle旨在解决这一问题，提升训练和推理效率及分辨率。 Method: 提出Token-Shuffle和Token-Unshuffle操作，通过合并和恢复图像标记，减少输入标记数量，同时保持高效训练与推理。 Result: 在GenAI-benchmark中，2.7B模型在困难提示下得分0.77，优于LlamaGen和LDM，支持2048x2048分辨率生成。 Conclusion: Token-Shuffle为高效高分辨率图像生成提供了基础设计，展现了自回归模型在图像合成中的潜力。 Abstract: Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

[66] LiDPM: Rethinking Point Diffusion for Lidar Scene Completion

Tetiana Martyniuk,Gilles Puy,Alexandre Boulch,Renaud Marlet,Raoul de Charette

Main category: cs.CV

TL;DR: 论文提出LiDPM方法，通过优化初始点选择，证明了在场景级别完成任务中，无需局部扩散近似，且效果优于现有方法。

Details

Motivation: 解决直接在大规模户外场景的激光雷达点上训练扩散模型的挑战，尤其是从白噪声生成精细细节的困难。 Method: 提出LiDPM方法，使用优化的初始点选择，避免局部扩散近似，直接应用原始DDPM。 Result: 在SemanticKITTI数据集上展示了更好的场景完成效果。 Conclusion: LiDPM方法通过优化初始点选择，证明了原始DDPM在场景级别任务中的有效性，无需局部扩散近似。 Abstract: Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is https://astra-vision.github.io/LiDPM .

cs.GR [Back]

[67] ePBR: Extended PBR Materials in Image Synthesis

Yu Guo,Zhiqiang Lao,Xiyun Song,Yubin Zhou,Zongfang Lin,Heather Yu

Main category: cs.GR

TL;DR: 提出了一种扩展的物理渲染（ePBR）材料，结合反射和透射特性，用于合成透明材料，如玻璃和窗户。

Details

Motivation: 现有物理渲染（PBR）材料难以处理高镜面和透明表面，而基于学习的方法缺乏物理一致性。 Method: 扩展了固有图像表示，提出显式固有合成框架，结合反射和透射特性。 Result: 实现了透明材料的可控合成，并提供了精确的编辑能力。 Conclusion: ePBR材料为透明材料的合成提供了高效且可控的解决方案。 Abstract: Realistic indoor or outdoor image synthesis is a core challenge in computer vision and graphics. The learning-based approach is easy to use but lacks physical consistency, while traditional Physically Based Rendering (PBR) offers high realism but is computationally expensive. Intrinsic image representation offers a well-balanced trade-off, decomposing images into fundamental components (intrinsic channels) such as geometry, materials, and illumination for controllable synthesis. However, existing PBR materials struggle with complex surface models, particularly high-specular and transparent surfaces. In this work, we extend intrinsic image representations to incorporate both reflection and transmission properties, enabling the synthesis of transparent materials such as glass and windows. We propose an explicit intrinsic compositing framework that provides deterministic, interpretable image synthesis. With the Extended PBR (ePBR) Materials, we can effectively edit the materials with precise controls.

[68] Bolt: Clothing Virtual Characters at Scale

Jonathan Leaf,David Sebastian Minor,Gilles Daviet,Nuttapong Chentanez,Greg Klar,Ed Quigley

Main category: cs.GR

TL;DR: Bolt系统通过三阶段（转移、悬挂、绑定）自动将服装从源身体适配到新体型，解决了虚拟角色服装适配的复杂性问题。

Details

Motivation: 虚拟角色服装适配是一个耗时且通常需要手动完成的过程，尤其是当角色体型差异较大时，问题变得尤为复杂。 Method: Bolt系统采用三阶段方法：1）通过3D网格转移和2D缝纫图案优化适配新体型；2）模拟悬挂过程以解缠服装；3）将服装绑定到新角色。 Result: 系统实现了全自动服装适配，无需人工干预，适用于大规模应用。 Conclusion: Bolt系统为虚拟角色服装适配提供了一种高效、自动化的解决方案，适用于游戏、动画等多种场景。 Abstract: Clothing virtual characters is a time-consuming and often manual process. Outfits can be composed of multiple garments, and each garment must be fitted to the unique shape of a character. Since characters can vary widely in size and shape, fitting outfits to many characters is a combinatorially large problem. We present Bolt, a system designed to take outfits originally authored on a source body and fit them to new body shapes via a three stage transfer, drape, and rig process. First, our new garment transfer method transforms each garment's 3D mesh positions to the new character, then optimizes the garment's 2D sewing pattern while maintaining key features of the original seams and boundaries. Second, our system simulates the transferred garments to progressively drape and untangle each garment in the outfit. Finally, the garments are rigged to the new character. This entire process is automatic, making it feasible to clothe characters at scale with no human intervention. Clothed characters are then ready for immediate use in applications such as gaming, animation, synthetic generation, and more.

[69] CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos

Shucheng Gong,Lingzhe Zhao,Wenpu Li,Hong Xie,Yin Zhang,Shiyu Zhao,Peidong Liu

Main category: cs.GR

TL;DR: 提出了一种名为CasualHDRSplat的方法，用于从随意拍摄的视频中高效重建3D HDR场景，解决了传统方法依赖固定曝光和多视角图像的局限性。

Details

Motivation: 现有方法依赖低动态范围（LDR）图像或多视角固定曝光图像，限制了场景细节的捕捉且数据采集复杂。 Method: 提出了一种统一的可微分物理成像模型，通过连续时间轨迹约束联合优化曝光时间、相机响应函数、相机位姿和3D HDR场景。 Result: 实验表明，该方法在鲁棒性和渲染质量上优于现有方法。 Conclusion: CasualHDRSplat提供了一种灵活且高效的方式，适用于实际场景中的3D HDR重建。 Abstract: Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at https://github.com/WU-CVGL/CasualHDRSplat

cs.CL [Back]

[70] Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

Cong Qi,Hanzhang Fang,Tianxing Hu,Siqi Jiang,Wei Zhi

Main category: cs.CL

TL;DR: GeneMamba是一种基于状态空间建模的单细胞转录组学基础模型，通过Bi-Mamba架构实现线性时间复杂度的双向基因上下文捕捉，优于传统Transformer方法。

Details

Motivation: 解决scRNA-seq数据的高维度、稀疏性和批次效应等计算挑战，同时克服Transformer模型的二次复杂性和长程依赖处理不足的问题。 Method: 采用Bi-Mamba架构，结合生物学目标（如通路感知对比损失和基于排序的基因编码），在近3000万个细胞上进行预训练。 Result: 在多批次整合、细胞类型注释和基因-基因相关性等任务中表现优异，具有高性能、可解释性和鲁棒性。 Conclusion: GeneMamba是Transformer方法的实用且强大的替代方案，推动了大规模单细胞数据分析工具的发展。 Abstract: Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.

[71] Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Amit Agarwal

Main category: cs.CL

TL;DR: 比较BPE、SentencePiece和字符级分词在低资源印度语言NER任务中的表现，发现SentencePiece在跨语言零样本设置中表现最佳。

Details

Motivation: 研究BPE在低资源印度语言NER任务中的适用性不足，探索更优的分词方法。 Method: 系统比较BPE、SentencePiece和字符级分词策略，评估其语言特性和下游任务表现。 Result: SentencePiece在跨语言零样本设置中表现优于BPE，尤其在形态复杂语言中。 Conclusion: SentencePiece是低资源印度语言NER任务中更有效的分词策略。 Abstract: Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer. Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications.

[72] Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Luca Moroni,Giovanni Puccetti,Pere-Lluis Huguet Cabot,Andrei Stefan Bejgu,Edoardo Barba,Alessio Miaschi,Felice Dell'Orletta,Andrea Esuli,Roberto Navigli

Main category: cs.CL

TL;DR: 本文提出了一种名为SAVA的新方法，用于优化英语大语言模型（LLM）在意大利语中的表现，通过词汇替换技术显著降低了token生成率和模型参数。

Details

Motivation: 尽管现有的大语言模型支持多语言，但它们在非英语语言上的表现并不高效，存在token生成率高和推理速度慢的问题。 Method: 提出了语义对齐词汇适应（SAVA）方法，利用神经映射进行词汇替换，优化了Mistral-7b-v0.1和Llama-3.1-8B模型。 Result: SAVA在多个下游任务中表现优异，Mistral-7b-v0.1的token生成率降低了25%，Llama-3.1-8B的参数量减少了10亿。 Conclusion: 通过词汇适应和有限的持续训练，优化后的模型在意大利语任务中恢复了性能，并在多选和生成任务中表现良好。 Abstract: The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

[73] Do Words Reflect Beliefs? Evaluating Belief Depth in Large Language Models

Shariar Kabir,Kevin Esterling,Yue Dong

Main category: cs.CL

TL;DR: 论文提出了一种评估大型语言模型（LLMs）政治信仰深度的新框架，通过分析论证一致性和不确定性量化，发现LLMs的信仰稳定性是主题特定的，而非统一的意识形态立场。

Details

Motivation: 现有研究通常将LLMs的政治倾向简单归类为左倾或右倾，但未深入探讨其回应是否反映真实信仰或仅是训练数据的表面一致性。 Method: 提出新框架，评估12个LLMs在19项经济政策上的信仰稳定性，通过支持性和反对性论证测试其一致性。 Result: LLMs表现出主题特定的信仰稳定性，左倾和右倾模型的回应一致性分别高达95%和89%，语义熵能有效区分表面一致性与真实信仰（AUROC=0.78）。 Conclusion: LLMs并不具备稳定的人类意识形态，强调在实际应用中需进行主题特定的可靠性评估。 Abstract: Large Language Models (LLMs) are increasingly shaping political discourse, yet their responses often display inconsistency when subjected to scrutiny. While prior research has primarily categorized LLM outputs as left- or right-leaning to assess their political stances, a critical question remains: Do these responses reflect genuine internal beliefs or merely surface-level alignment with training data? To address this, we propose a novel framework for evaluating belief depth by analyzing (1) argumentative consistency and (2) uncertainty quantification. We evaluate 12 LLMs on 19 economic policies from the Political Compass Test, challenging their belief stability with both supportive and opposing arguments. Our analysis reveals that LLMs exhibit topic-specific belief stability rather than a uniform ideological stance. Notably, up to 95% of left-leaning models' responses and 89% of right-leaning models' responses remain consistent under the challenge, enabling semantic entropy to achieve high accuracy (AUROC=0.78), effectively distinguishing between surface-level alignment from genuine belief. These findings call into question the assumption that LLMs maintain stable, human-like political ideologies, emphasizing the importance of conducting topic-specific reliability assessments for real-world applications.

[74] Agree to Disagree? A Meta-Evaluation of LLM Misgendering

Arjun Subramonian,Vagrant Gautam,Preethi Seshadri,Dietrich Klakow,Kai-Wei Chang,Yizhou Sun

Main category: cs.CL

TL;DR: 论文研究了LLM性别错误评估方法的收敛效度，发现不同方法之间存在显著不一致，并提出了改进建议。

Details

Motivation: 探讨现有LLM性别错误评估方法是否具有收敛效度，即结果是否一致。 Method: 通过系统元评估，将三种数据集转换为支持概率和生成评估，并自动评估6个模型。 Result: 发现方法间在实例、数据集和模型层面存在20.2%的不一致，且自动评估与人类评估存在本质差异。 Conclusion: 建议未来评估需改进方法，并质疑LLM评估中广泛假设方法一致性的惯例。 Abstract: Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.

[75] How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study

Rendi Chevi,Kentaro Inui,Thamar Solorio,Alham Fikri Aji

Main category: cs.CL

TL;DR: 研究发现，LLM的语言风格（如权威性、确定性、表达清晰度等）显著影响用户偏好，但具体影响因用户群体和个体特质而异。

Details

Motivation: 探讨LLM的语言风格如何影响用户偏好，以及这种影响是否因用户个体特质而异。 Method: 通过探索性和实验性用户研究，分析语言风格与用户偏好的关系。 Result: 语言风格确实影响用户偏好，但具体影响因用户群体和个体特质而异。 Conclusion: 未来研究需扩大样本多样性和规模，以更全面分析语言风格、个体特质与偏好的关系。 Abstract: What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM's responses would be one of the influential variables, recent studies have found that inaccurate LLM's responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interestingly fall under the broader category of language style, implying that the style in the LLM's responses might meaningfully influence users' preferences. This hypothesized dynamic could have double-edged consequences: enhancing the overall user experience while simultaneously increasing their susceptibility to risks such as LLM's misinformation or hallucinations. In this short paper, we present our preliminary studies in exploring this subject. Through a series of exploratory and experimental user studies, we found that LLM's language style does indeed influence user's preferences, but how and which language styles influence the preference varied across different user populations, and more interestingly, moderated by the user's very own individual traits. As a preliminary work, the findings in our studies should be interpreted with caution, particularly given the limitations in our samples, which still need wider demographic diversity and larger sample sizes. Our future directions will first aim to address these limitations, which would enable a more comprehensive joint effect analysis between the language style, individual traits, and preferences, and further investigate the potential causal relationship between and beyond these variables.

[76] Co-CoT: A Prompt-Based Framework for Collaborative Chain-of-Thought Reasoning

Seunghyun Yoo

Main category: cs.CL

TL;DR: 提出了一种交互式思维链框架（Interactive CoT），通过透明化、模块化和用户可编辑的推理过程，提升AI的可解释性和负责任使用。

Details

Motivation: 短内容泛滥和AI快速普及导致深度思考机会减少，削弱用户批判性思维和对AI输出的理解。 Method: 设计交互式思维链框架，将推理分解为可检查、修改和重新执行的模块，并集成轻量级编辑适应机制。 Result: 框架增强了用户认知参与，支持多样认知风格和意图，同时确保伦理透明和隐私保护。 Conclusion: 该框架为促进AI系统中的批判性参与、负责任交互和包容性适应提供了设计原则和架构。 Abstract: Due to the proliferation of short-form content and the rapid adoption of AI, opportunities for deep, reflective thinking have significantly diminished, undermining users' critical thinking and reducing engagement with the reasoning behind AI-generated outputs. To address this issue, we propose an Interactive Chain-of-Thought (CoT) Framework that enhances human-centered explainability and responsible AI usage by making the model's inference process transparent, modular, and user-editable. The framework decomposes reasoning into clearly defined blocks that users can inspect, modify, and re-execute, encouraging active cognitive engagement rather than passive consumption. It further integrates a lightweight edit-adaptation mechanism inspired by preference learning, allowing the system to align with diverse cognitive styles and user intentions. Ethical transparency is ensured through explicit metadata disclosure, built-in bias checkpoint functionality, and privacy-preserving safeguards. This work outlines the design principles and architecture necessary to promote critical engagement, responsible interaction, and inclusive adaptation in AI systems aimed at addressing complex societal challenges.

[77] The Rise of Small Language Models in Healthcare: A Comprehensive Survey

Muskan Garg,Shaina Raza,Shebuti Rayana,Xingyi Liu,Sunghwan Sohn

Main category: cs.CL

TL;DR: 本文综述了小型语言模型（SLMs）在医疗保健领域的应用，提出了一个分类框架，并展示了SLMs在资源受限环境中的高效性能。

Details

Motivation: 随着大型语言模型（LLMs）在医疗应用中的进展，数据隐私和资源限制问题日益突出，SLMs为下一代医疗信息学提供了可扩展且临床可行的解决方案。 Method: 通过分类框架分析SLMs在三个维度（NLP任务、利益相关者角色和护理连续性）的表现，并探讨了模型构建、优化和压缩技术。 Result: 展示了SLMs在医疗领域NLP任务中的实验成果，突出了其变革潜力。 Conclusion: 本文为医疗专业人员提供了全面的综述和资源，支持未来SLMs在医疗领域的研究与发展。 Abstract: Despite substantial progress in healthcare applications driven by large language models (LLMs), growing concerns around data privacy, and limited resources; the small language models (SLMs) offer a scalable and clinically viable solution for efficient performance in resource-constrained environments for next-generation healthcare informatics. Our comprehensive survey presents a taxonomic framework to identify and categorize them for healthcare professionals and informaticians. The timeline of healthcare SLM contributions establishes a foundational framework for analyzing models across three dimensions: NLP tasks, stakeholder roles, and the continuum of care. We present a taxonomic framework to identify the architectural foundations for building models from scratch; adapting SLMs to clinical precision through prompting, instruction fine-tuning, and reasoning; and accessibility and sustainability through compression techniques. Our primary objective is to offer a comprehensive survey for healthcare professionals, introducing recent innovations in model optimization and equipping them with curated resources to support future research and development in the field. Aiming to showcase the groundbreaking advancements in SLMs for healthcare, we present a comprehensive compilation of experimental results across widely studied NLP tasks in healthcare to highlight the transformative potential of SLMs in healthcare. The updated repository is available at Github

[78] Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Hannah Cyberey,David Evans

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）中的“审查”机制，提出了一种检测和控制模型输出中审查水平的方法，并揭示了“思维抑制”这一额外维度。

Details

Motivation: 理解LLMs如何通过调整拒绝有害请求和生成符合控制者偏好的响应来实现“审查”。 Method: 使用表征工程技术分析开放权重的安全调整模型，找到拒绝-服从向量以检测和控制审查水平，并分析推理LLMs中的“思维抑制”。 Result: 发现了一种可以抑制模型推理过程的向量，通过应用其负倍数可以移除审查。 Conclusion: 研究揭示了LLMs中审查的多维机制，并提供了控制这些机制的工具。 Abstract: Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector

[79] MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

Chanhee Park,Hyeonseok Moon,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: MIRAGE是一个专门为RAG系统评估设计的问答数据集，包含7,560个实例和37,800个检索条目，并引入新的评估指标以衡量RAG的适应性。

Details

Motivation: 当前RAG系统的评估缺乏详细且组件特定的基准，限制了对其性能的深入分析。 Method: 提出MIRAGE数据集，包含精心策划的实例和检索条目，并设计新的评估指标（如噪声脆弱性、上下文可接受性等）。 Result: 通过实验揭示了RAG系统中模型对的最佳对齐方式及其内部动态。 Conclusion: MIRAGE数据集和评估代码公开可用，支持多样化研究需求。 Abstract: Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.

[80] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Minju Seo,Jinheon Baek,Seongyun Lee,Sung Ju Hwang

Main category: cs.CL

TL;DR: PaperCoder是一个基于多智能体LLM的框架，能够将机器学习论文转化为功能性代码仓库，通过规划、分析和生成三个阶段实现，并在实验中表现出高效和高质量的实现能力。

Details

Motivation: 机器学习研究的代码实现往往缺失，导致复现和后续研究困难。利用LLM在理解科学文档和生成高质量代码方面的优势，PaperCoder旨在解决这一问题。 Method: PaperCoder分为三个阶段：规划（设计系统架构和配置文件）、分析（解析实现细节）和生成（生成模块化代码）。每个阶段由专门设计的智能体协作完成。 Result: 实验表明，PaperCoder能生成高质量且忠实于原文的代码实现，并在PaperBench基准测试中显著优于基线方法。 Conclusion: PaperCoder通过多智能体LLM框架，有效解决了机器学习论文代码实现的缺失问题，展示了其在自动化代码生成方面的潜力。 Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.

[81] A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation

Yangxinyu Xie,Bowen Jiang,Tanwi Mallick,Joshua David Bergerson,John K. Hutchison,Duane R. Verner,Jordan Branham,M. Ross Alexander,Robert B. Ross,Yan Feng,Leslie-Anne Levy,Weijie Su,Camillo J. Taylor

Main category: cs.CL

TL;DR: 论文提出了一种基于检索增强生成（RAG）的多智能体LLM系统，用于自然灾害和极端天气事件的分析与决策支持，并以WildfireGPT为例验证其有效性。

Details

Motivation: 大型语言模型（LLMs）在提供通用信息时缺乏上下文特异性，尤其在需要专业知识的领域（如自然灾害）中表现不足。 Method: 采用RAG框架整合自然灾害数据、观测数据集和科学文献，构建多智能体LLM系统WildfireGPT，以用户为中心提供定制化风险分析。 Result: 在10个专家主导的案例研究中，WildfireGPT显著优于现有基于LLM的决策支持解决方案。 Conclusion: RAG增强的多智能体LLM系统能够有效提升自然灾害决策支持的准确性和上下文相关性。 Abstract: Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.

[82] Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?

Kaidong Feng,Zhu Sun,Jie Yang,Hui Fang,Xinghua Qu,Wenyuan Liu

Main category: cs.CL

TL;DR: 本文探讨了知识蒸馏（KD）在大型语言模型（LLMs）捆绑生成中的应用，旨在降低计算成本同时保持性能。

Details

Motivation: 部署大规模LLMs带来高计算成本，知识蒸馏提供了一种高效解决方案。 Method: 提出一个综合KD框架，逐步提取知识、捕获不同量的知识，并利用互补的LLM适应技术。 Result: 实验表明知识格式、数量及利用方法共同影响捆绑生成性能。 Conclusion: 知识蒸馏在高效且有效的LLM捆绑生成中具有显著潜力。 Abstract: LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution, transferring expertise from large teacher models to compact student models. This study systematically investigates knowledge distillation approaches for bundle generation, aiming to minimize computational demands while preserving performance. We explore three critical research questions: (1) how does the format of KD impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence performance? and (3) how do different ways of utilizing the distilled knowledge affect performance? We propose a comprehensive KD framework that (i) progressively extracts knowledge (patterns, rules, deep thoughts); (ii) captures varying quantities of distilled knowledge through different strategies; and (iii) exploits complementary LLM adaptation techniques (in-context learning, supervised fine-tuning, combination) to leverage distilled knowledge in small student models for domain-specific adaptation and enhanced efficiency. Extensive experiments provide valuable insights into how knowledge format, quantity, and utilization methodologies collectively shape LLM-based bundle generation performance, exhibiting KD's significant potential for more efficient yet effective LLM-based bundle generation.

[83] Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues

Jinfeng Zhou,Yuxuan Chen,Jianing Yin,Yongkang Huang,Yihan Shi,Xikun Zhang,Libiao Peng,Rongsheng Zhang,Tangjie Lv,Zhipeng Hu,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: CRDial是一个新框架，通过多轮对话实现认知重构（CR），解决了现有方法的不足，并生成了高质量的双语数据集Crisp和基于Crisp的对话模型Crispers。

Details

Motivation: 临床医生短缺和心理健康污名化促使开发人机交互心理治疗工具，但现有方法无法有效支持CR的心理治疗过程。 Method: 提出CRDial框架，设计多轮对话阶段（识别和重构负面想法），整合支持性对话策略，并采用多通道循环机制实现迭代CR。 Result: 生成Crisp数据集和Crispers对话模型，在多项评估中表现优越。 Conclusion: CRDial和Crispers为认知重构提供了有效的工具，填补了现有方法的空白。 Abstract: Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual's negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.

[84] Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo

Ocheme Anthony Ekle,Biswarup Das

Main category: cs.CL

TL;DR: 研究开发了基于RNN和Transformer的英语-伊博语翻译模型，结合迁移学习显著提升了低资源语言的翻译性能。

Details

Motivation: 解决低资源非洲语言伊博语的翻译问题，填补现有技术空白。 Method: 使用RNN（LSTM和GRU）和注意力机制，结合MarianNMT预训练模型进行迁移学习。 Result: RNN模型表现接近现有基准，迁移学习带来+4.83 BLEU提升，翻译准确率达70%。 Conclusion: RNN结合迁移学习能有效提升低资源语言翻译性能。 Abstract: In this study, we develop Neural Machine Translation (NMT) and Transformer-based transfer learning models for English-to-Igbo translation - a low-resource African language spoken by over 40 million people across Nigeria and West Africa. Our models are trained on a curated and benchmarked dataset compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl, all verified by native language experts. We leverage Recurrent Neural Network (RNN) architectures, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), enhanced with attention mechanisms to improve translation accuracy. To further enhance performance, we apply transfer learning using MarianNMT pre-trained models within the SimpleTransformers framework. Our RNN-based system achieves competitive results, closely matching existing English-Igbo benchmarks. With transfer learning, we observe a performance gain of +4.83 BLEU points, reaching an estimated translation accuracy of 70%. These findings highlight the effectiveness of combining RNNs with transfer learning to address the performance gap in low-resource language translation tasks.

[85] JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive Learning

Zhaolu Kang,Hongtian Cai,Xiangyang Ji,Jinzhe Li,Nanfei Gu

Main category: cs.CL

TL;DR: JurisCTC是一种新型模型，旨在通过无监督领域适应（UDA）提升法律判决预测（LJP）任务的准确性，尤其在民事和刑事法律领域之间实现知识迁移。

Details

Motivation: 现有UDA方法在法律领域的应用较少，且法律文本复杂且标注数据稀缺，JurisCTC旨在解决这些问题。 Method: JurisCTC采用对比学习区分不同领域的样本，并促进法律领域间的知识迁移。 Result: JurisCTC在LJP任务中表现优异，最高准确率分别达到76.59%和78.83%。 Conclusion: JurisCTC在法律领域知识迁移中表现出显著优势，为复杂法律文本处理提供了有效解决方案。 Abstract: In recent years, Unsupervised Domain Adaptation (UDA) has gained significant attention in the field of Natural Language Processing (NLP) owing to its ability to enhance model generalization across diverse domains. However, its application for knowledge transfer between distinct legal domains remains largely unexplored. To address the challenges posed by lengthy and complex legal texts and the limited availability of large-scale annotated datasets, we propose JurisCTC, a novel model designed to improve the accuracy of Legal Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC facilitates effective knowledge transfer across various legal domains and employs contrastive learning to distinguish samples from different domains. Specifically, for the LJP task, we enable knowledge transfer between civil and criminal law domains. Compared to other models and specific large language models (LLMs), JurisCTC demonstrates notable advancements, achieving peak accuracies of 76.59% and 78.83%, respectively.

[86] Evaluating and Mitigating Bias in AI-Based Medical Text Generation

Xiuying Chen,Tairan Wang,Juexiao Zhou,Zirui Song,Xin Gao,Xiangliang Zhang

Main category: cs.CL

TL;DR: 研究探讨了医疗领域文本生成中的公平性问题，提出了一种选择性优化算法以减少偏见，显著降低了不同群体间的性能差异。

Details

Motivation: AI系统在医疗应用中可能反映并放大人类偏见，影响对历史弱势群体的服务质量，文本生成领域的公平性问题尚未充分研究。 Method: 提出一种选择性优化算法，综合考虑词级准确性和病理准确性，确保过程可微分以有效训练模型。 Result: 算法在多种骨干网络、数据集和模态中验证，显著减少了不同群体间的性能差异（超过30%），同时文本生成准确率变化在2%以内。 Conclusion: 该算法有效提升了医疗文本生成的公平性，缓解了深度学习模型可能带来的偏见问题，代码已开源。 Abstract: Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at https://github.com/iriscxy/GenFair.

[87] CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality

Junyan Zhang,Shuliang Liu,Aiwei Liu,Yubo Gao,Jungang Li,Xiaojie Gu,Xuming Hu

Main category: cs.CL

TL;DR: CoheMark是一种先进的句子级水印技术，通过利用句子间的连贯关系提升逻辑流畅性，同时保持高文本质量和强水印检测能力。

Details

Motivation: 现有句子级水印技术依赖任意分割或生成过程，限制了合适句子的可用性，影响了生成内容的质量。 Method: CoheMark通过训练模糊c均值聚类选择句子，并应用特定下一句选择标准。 Result: 实验表明，CoheMark在保持文本质量的同时实现了强水印强度。 Conclusion: CoheMark在平衡文本质量与水印检测方面表现优异。 Abstract: Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.

[88] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova,Hung Thinh Truong,Rahmad Mahendra,Zenan Zhai,Rongxin Zhu,Daniel Beck,Jey Han Lau

Main category: cs.CL

TL;DR: FLUKE是一个任务无关的框架，通过系统性的最小化测试数据变化评估模型鲁棒性，涵盖从拼写到方言和风格的语言层面变化。

Details

Motivation: 评估模型在不同语言变化下的鲁棒性，揭示任务依赖性和模型脆弱性。 Method: 利用大型语言模型（LLMs）和人工验证生成控制性语言变化，评估微调模型和LLMs在四个NLP任务中的表现。 Result: 语言变化影响因任务而异；LLMs整体鲁棒性更强但仍脆弱；所有模型对否定修改普遍脆弱。 Conclusion: 系统性鲁棒性测试对理解模型行为至关重要。 Abstract: We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

[89] Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection

Zihan Wang,Lu Yuan,Zhengxuan Zhang,Qing Zhao

Main category: cs.CL

TL;DR: 论文提出了一种双方面共情框架（DAE），结合认知和情感共情，从创作者和读者角度分析虚假信息，并通过实验验证其优于现有方法。

Details

Motivation: 传统虚假信息检测方法忽视人类共情在传播中的作用，DAE填补了这一空白。 Method: DAE整合认知和情感共情，利用大语言模型（LLMs）模拟读者反应，并引入共情感知过滤机制。 Result: 实验表明DAE在基准数据集上优于现有方法。 Conclusion: DAE为多模态虚假信息检测提供了新范式。 Abstract: In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integrates cognitive and emotional empathy to analyze misinformation from both the creator and reader perspectives. By examining creators' cognitive strategies and emotional appeals, as well as simulating readers' cognitive judgments and emotional responses using Large Language Models (LLMs), DAE offers a more comprehensive and human-centric approach to misinformation detection. Moreover, we further introduce an empathy-aware filtering mechanism to enhance response authenticity and diversity. Experimental results on benchmark datasets demonstrate that DAE outperforms existing methods, providing a novel paradigm for multimodal misinformation detection.

[90] M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

Chengguang Gan,Sunbowen Lee,Zhixi Cai,Yanbin Wei,Lei Zheng,Yunhao Liang,Shiwen Ni,Tatsunori Mori

Main category: cs.CL

TL;DR: 论文首次将互增强效应（MRE）扩展到多模态信息提取领域，提出多模态互增强效应（M-MRE）任务，并构建相应数据集。通过提出的Prompt Format Adapter（PFA）验证了MRE在多模态场景中的有效性。

Details

Motivation: 探索MRE在多模态领域的适用性，填补其在视觉和多模态领域的研究空白。 Method: 提出M-MRE任务及数据集，设计兼容大型视觉语言模型的PFA方法。 Result: 实验证明MRE在多模态任务中同样有效，支持跨任务互增强。 Conclusion: MRE在多模态领域具有通用性，为多模态任务互增强提供了新思路。 Abstract: Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.

[91] PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare

Jose G. Moreno,Jesus Lovon,M'Rick Robin-Charlet,Christine Damase-Michel,Lynda Tamine

Main category: cs.CL

TL;DR: PatientDx框架通过模型合并技术，无需微调患者数据，即可提升LLM在健康预测任务中的性能，同时避免数据隐私问题。

Details

Motivation: 解决LLM微调时对大量敏感数据的依赖及隐私问题，特别是在医疗领域。 Method: 基于模型合并技术，优化构建块合并策略，利用数值推理模型调整超参数，无需训练数据。 Result: 在MIMIC-IV数据集上，AUROC提升7%，且比微调模型更少数据泄露风险。 Conclusion: PatientDx在保持性能的同时有效保护数据隐私，适用于敏感领域。 Abstract: Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task. However, performance improvement comes at the cost of training on vast amounts of annotated data which could be sensitive leading to significant data privacy concerns. In particular, the healthcare domain is one of the most sensitive domains exposed to data privacy issues. In this paper, we present PatientDx, a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data. Our proposal is based on recently proposed techniques known as merging of LLMs and aims to optimize a building block merging strategy. PatientDx uses a pivotal model adapted to numerical reasoning and tunes hyperparameters on examples based on a performance metric but without training of the LLM on these data. Experiments using the mortality tasks of the MIMIC-IV dataset show improvements up to 7% in terms of AUROC when compared to initial models. Additionally, we confirm that when compared to fine-tuned models, our proposal is less prone to data leak problems without hurting performance. Finally, we qualitatively show the capabilities of our proposal through a case study. Our best model is publicly available at https://huggingface.co/ Jgmorenof/mistral\_merged\_0\_4.

[92] LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

Yongxuan Wu,Runyu Chen,Peiyu Liu,Hongjin Qian

Main category: cs.CL

TL;DR: 论文构建了首个基于直播的冗余丰富的长文本数据集，评估了现有方法在长上下文理解中的表现，并提出了一种新基线方法。

Details

Motivation: 现有长文本数据集未能反映真实对话的复杂性和冗余性，限制了模型在实际场景中的应用。 Method: 构建了一个基于直播的长文本数据集，设计了检索依赖、推理依赖和混合三类任务，评估了现有LLMs和专用方法的表现。 Result: 现有方法在冗余输入上表现不佳，且无单一方法在所有任务中表现最优；提出的新基线方法在冗余处理上表现更好。 Conclusion: 研究揭示了当前方法的局限性，为改进长上下文理解提供了方向，并为实际电商系统开发提供了基础。 Abstract: Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at https://github.com/Yarayx/livelongbench.

[93] PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona

Jihyun Lee,Yejin Jeon,Seungyeon Seo,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 论文提出PicPersona-TOD数据集和Pictor模型，通过用户图像实现个性化对话响应，提升用户体验。

Details

Motivation: 现有任务导向对话系统响应单调，缺乏个性化，无法适应用户属性。 Method: 结合用户图像作为人设，利用第一印象、对话策略提示和外部知识减少幻觉，构建PicPersona-TOD数据集，并开发Pictor模型。 Result: 人类评估证实个性化响应提升用户体验，Pictor模型在未见领域表现稳健。 Conclusion: PicPersona-TOD和Pictor模型为任务导向对话系统提供了更个性化和高效的解决方案。 Abstract: Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.

[94] Creating Targeted, Interpretable Topic Models with LLM-Generated Text Augmentation

Anna Lieb,Maneesh Arora,Eni Mustafaraj

Main category: cs.CL

TL;DR: 论文探讨了利用LLM生成的文本增强技术改进主题建模的实用性和可解释性，通过政治学案例验证了GPT-4增强后的主题建模效果。

Details

Motivation: 主题建模在社会科学研究中存在可解释性和实用性不足的问题，希望通过LLM生成的文本增强技术解决这些问题。 Method: 使用GPT-4生成的文本增强主题建模，并通过政治学案例研究验证其效果。 Result: GPT-4增强后的主题建模生成了高度可解释的类别，适用于特定领域的研究问题。 Conclusion: LLM生成的文本增强技术能显著提升主题建模的实用性和可解释性，减少人工干预。 Abstract: Unsupervised machine learning techniques, such as topic modeling and clustering, are often used to identify latent patterns in unstructured text data in fields such as political science and sociology. These methods overcome common concerns about reproducibility and costliness involved in the labor-intensive process of human qualitative analysis. However, two major limitations of topic models are their interpretability and their practicality for answering targeted, domain-specific social science research questions. In this work, we investigate opportunities for using LLM-generated text augmentation to improve the usefulness of topic modeling output. We use a political science case study to evaluate our results in a domain-specific application, and find that topic modeling using GPT-4 augmentations creates highly interpretable categories that can be used to investigate domain-specific research questions with minimal human guidance.

[95] Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

Xin Yi,Shunfan Zhengc,Linlin Wanga,Xiaoling Wang,Liang He

Main category: cs.CL

TL;DR: 论文提出了一种名为CDG-KD的统一框架，用于在未经授权的知识蒸馏中实现双向攻击（擦除和伪造水印），并展示了其有效性。

Details

Motivation: 研究水印在未经授权知识蒸馏中的鲁棒性和不可伪造性，填补现有研究的空白。 Method: 采用对比解码和双向蒸馏技术，分别实现水印擦除和伪造。 Result: 实验证明CDG-KD能有效攻击水印，同时保持模型性能。 Conclusion: 强调开发鲁棒且不可伪造的水印方案的重要性。 Abstract: Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

[96] HalluLens: LLM Hallucination Benchmark

Yejin Bang,Ziwei Ji,Alan Schelten,Anthony Hartshorn,Tara Fowler,Cheng Zhang,Nicola Cancedda,Pascale Fung

Main category: cs.CL

TL;DR: 论文提出了一个全面的幻觉基准，通过明确分类和动态测试集生成解决LLM幻觉问题，促进研究一致性。

Details

Motivation: 解决LLM生成内容与用户输入或训练数据不一致的幻觉问题，提升用户信任和生成AI系统的采用率。 Method: 引入新的外在和内在评估任务，建立清晰的幻觉分类法，并动态生成测试集以防止数据泄漏。 Result: 提出了一个统一的幻觉基准，分析了现有基准的局限性，并区分了幻觉与事实性评估。 Conclusion: 通过明确分类和动态测试集，该工作为LLM幻觉研究提供了标准化框架，推动了领域发展。 Abstract: Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.

[97] When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Rei Higuchi,Ryotaro Kawata,Naoki Nishikawa,Kazusato Oko,Shoichiro Yamaguchi,Sosuke Kobayashi,Seiya Tokui,Kohei Hayashi,Daisuke Okanohara,Taiji Suzuki

Main category: cs.CL

TL;DR: 研究探讨了在预训练数据前添加元数据对语言模型性能的影响，发现其效果取决于下游任务是否能从提示中推断出潜在语义。

Details

Motivation: 探索预训练时添加元数据对模型性能的影响，尤其是为何在某些下游任务中表现提升而在其他任务中表现下降。 Method: 使用人工生成的数据（如概率上下文无关文法生成的数据）分析模型行为，研究元数据对性能的影响。 Result: 元数据在上下文足够长时能提升模型性能，但在上下文信息不足时反而会降低性能。 Conclusion: 元数据的有效性取决于下游任务提示是否能推断潜在语义，需根据任务特点谨慎使用。 Abstract: The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

[98] DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li

Main category: cs.CL

TL;DR: 论文通过构建大规模、难度分级的推理数据集，优化基础模型的训练数据选择，显著提升了模型的推理能力。

Details

Motivation: 解决学术界对基础模型训练过程和数据质量缺乏深入理解的问题。 Method: 构建包含340万独特查询和4000万蒸馏响应的大规模数据集，利用通过率和变异系数选择最有价值的数据，调整学习率进行推理训练。 Result: 在AIME2024数学推理基准测试中达到79.2%的通过率，超越多数蒸馏模型并接近最先进水平。 Conclusion: 公开数据集和方法，推动开源长推理大语言模型的快速发展。 Abstract: Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M

[99] RAGAT-Mind: A Multi-Granular Modeling Approach for Rumor Detection Based on MindSpore

Zhenkai Qin,Guifang Yang,Dongze Wu

Main category: cs.CL

TL;DR: RAGAT-Mind是一种基于MindSpore框架的多粒度中文谣言检测模型，结合了多种深度学习技术，在实验中表现出色。

Details

Motivation: 社交媒体上虚假信息泛滥，谣言检测成为自然语言处理的重要挑战。 Method: 模型整合了TextCNN、双向GRU、多头自注意力机制和双向图卷积网络（BiGCN），用于提取多粒度特征。 Result: 在Weibo1-Rumor数据集上，模型达到99.2%准确率和0.9919的macro-F1分数。 Conclusion: RAGAT-Mind结合层次化语言特征和图语义结构，具有强泛化能力和实用性。 Abstract: As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.

[100] Towards a comprehensive taxonomy of online abusive language informed by machine leaning

Samaneh Hosseini Moghaddam,Kelly Lyons,Cheryl Regehr,Vivek Goel,Kaitlyn Regehr

Main category: cs.CL

TL;DR: 该论文提出了一种用于在线文本中辱骂性语言分类的层次化多面分类法，整合了18个多标签数据集的分类系统，包含5个类别和17个维度。

Details

Motivation: 在线辱骂性语言的泛滥对个人和社区的健康与福祉构成重大风险，需要有效的方法识别和减轻有害内容。 Method: 采用系统性方法开发分类法，整合18个现有多标签数据集的分类系统，构建层次化和多面的分类框架。 Result: 提出的分类法包含5个类别和17个维度，涵盖辱骂性语言的上下文、目标、强度、直接性和主题等特征。 Conclusion: 该分类法为研究者、政策制定者和平台所有者提供了共享理解，有助于在线辱骂检测和缓解领域的协作与进展。 Abstract: The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.

[101] Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics

Zena Al-Khalili,Nick Howell,Dietrich Klakow

Main category: cs.CL

TL;DR: 论文分析了代码辅助LLMs在数学推理任务中生成程序的严谨性，发现其表现受模型能力和问题难度影响，开源模型表现较差。

Details

Motivation: 现有评估仅关注执行正确性，缺乏对生成程序的深入分析，本文填补了这一空白。 Method: 对五种LLMs在两个数学数据集上的生成程序进行手动和自动评估，关注其数学规则的运用。 Result: 数学规则的应用程度因模型能力和问题难度而异，闭源模型表现更好，开源模型表现较差。 Conclusion: 需超越执行准确性，深入评估代码辅助LLMs在数学领域的能力和限制。 Abstract: Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs' generated programs in response to math reasoning tasks. Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. Our results reveal that the distribution of grounding depends on LLMs' capabilities and the difficulty of math problems. Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly. On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems. Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs' capabilities and limits in the math domain.

[102] Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction

Yuanchang Ye,Weiyan Wen

Main category: cs.CL

TL;DR: 提出了一种基于Split Conformal Prediction（SCP）的框架，用于缓解大型视觉语言模型（LVLM）在视觉问答（VQA）任务中的幻觉问题，通过动态阈值校准和跨模态一致性验证实现不确定性量化。

Details

Motivation: LVLM在多模态推理中表现优异，但其输出常伴随高置信度的幻觉内容，对安全关键应用构成风险。 Method: 采用SCP框架，通过数据分区（校准集和测试集）计算非一致性分数，构建具有统计保证的预测集，动态调整预测集大小并消除先验分布假设。 Result: 在ScienceQA和MMMU基准测试中，SCP框架在所有α值下均实现了理论保证，并在不同校准-测试分割比例下表现稳定。 Conclusion: 该研究为多模态AI系统提供了可扩展的幻觉检测和不确定性感知决策方案，填补了理论可靠性与实际应用之间的鸿沟。 Abstract: This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all $\alpha$ values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.

[103] Energy Considerations of Large Language Model Inference and Efficiency Optimizations

Jared Fernandez,Clara Na,Vashisth Tiwari,Yonatan Bisk,Sasha Luccioni,Emma Strubell

Main category: cs.CL

TL;DR: 该论文分析了大型语言模型（LLM）推理优化的能源影响，提出了一种建模方法，并揭示了优化策略在不同工作负载下的效果差异，最高可减少73%的能源消耗。

Details

Motivation: 随着LLM规模和使用的增加，其计算和环境成本上升，现有研究多关注理想化场景的延迟优化，忽略了实际工作负载对能源的影响。 Method: 通过分箱策略对输入输出令牌分布和批量大小变化建模，分析软件框架、解码策略、GPU架构等多种因素对能源效率的影响。 Result: 研究发现，推理优化的效果高度依赖于工作负载特性、软件栈和硬件，实际能源消耗远高于理论估计，优化后能源消耗最多可减少73%。 Conclusion: 研究为可持续LLM部署提供了基础，并为未来AI基础设施的能源高效设计提供了指导。 Abstract: As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.

[104] Ensemble Bayesian Inference: Leveraging Small Language Models to Achieve LLM-level Accuracy in Profile Matching Tasks

Haru-Tada Sato,Fuka Matsuzaki,Jun-ichiro Takahashi

Main category: cs.CL

TL;DR: 研究提出了一种名为EBI的新方法，通过贝叶斯估计结合多个小型语言模型（SLM）的预测，使其性能超越单个模型，达到与大型语言模型（LLM）相当的准确性。

Details

Motivation: 探索如何利用小型语言模型组合实现与专有大型语言模型相当的准确性，同时降低计算资源需求。 Method: 提出Ensemble Bayesian Inference (EBI)方法，通过贝叶斯估计整合多个SLM的预测，并在多语言任务中验证其有效性。 Result: 实验表明EBI在多语言任务中表现优异，甚至能通过整合性能较差的模型提升整体性能。 Conclusion: EBI为资源有限的高性能AI系统提供了新思路，并展示了如何有效利用性能较低的模型。 Abstract: This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs). We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models. Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI's effectiveness. Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method's efficacy across different languages. These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance. Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach.

[105] Safety in Large Reasoning Models: A Survey

Cheng Wang,Yue Liu,Baolong Li,Duzhen Zhang,Zhongzhi Li,Junfeng Fang

Main category: cs.CL

TL;DR: 本文综述了大型推理模型（LRMs）的安全风险、攻击方式和防御策略，通过分类整理为未来研究提供清晰框架。

Details

Motivation: 随着LRMs在数学和编程等任务中展现强大推理能力，其安全漏洞和风险成为实际应用中的重大挑战，需系统梳理。 Method: 通过全面调查和分类整理，总结LRMs的安全问题、攻击手段及防御方法。 Result: 提出了一种详细的分类法，清晰呈现LRMs的安全现状。 Conclusion: 该研究为提升LRMs的安全性和可靠性提供了结构化指导，推动未来研究发展。 Abstract: Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.

[106] Multilingual Performance Biases of Large Language Models in Education

Vansh Gupta,Sankalan Pal Chowdhury,Vilém Zouhar,Donya Rooein,Mrinmaya Sachan

Main category: cs.CL

TL;DR: 研究评估了大型语言模型（LLMs）在六种非英语语言中的教育任务表现，发现其性能与训练数据中的语言资源量相关，建议部署前验证目标语言的表现。

Details

Motivation: 当前LLMs主要为英语设计，研究旨在验证其在非英语语言教育中的适用性。 Method: 评估了流行LLMs在六种语言（包括英语）上的四项教育任务：识别学生误解、提供针对性反馈、互动辅导和翻译评分。 Result: 模型表现与训练数据中的语言资源量相关，低资源语言表现较差，且与英语相比性能下降显著。 Conclusion: 建议在部署前验证LLMs在目标语言中的表现，以确保教育任务的有效性。 Abstract: Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.

[107] Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPT

Anuja Tayal,Devika Salunke,Barbara Di Eugenio,Paula Allen-Meares,Eulalia Puig Abril,Olga Garcia,Carolyn Dickens,Andrew Boyd

Main category: cs.CL

TL;DR: 研究比较了两种对话助手（基于神经符号架构和ChatGPT）在帮助心衰患者查询食物盐含量时的表现，发现前者更准确、高效，后者错误更少且需较少澄清，但患者无偏好。

Details

Motivation: 随着大型语言模型的普及，对话助手在医疗领域的应用增多，需通过实际用户评估比较传统架构与生成式AI的优缺点。 Method: 采用组内用户研究，比较基于神经符号架构的自研系统和ChatGPT版本的表现。 Result: 自研系统更准确、任务完成率更高且简洁，ChatGPT版本错误更少、需较少澄清，但患者无偏好。 Conclusion: 两种架构各有优劣，实际应用中需根据需求权衡选择。 Abstract: Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.

[108] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Piotr Nawrot,Robert Li,Renjie Huang,Sebastian Ruder,Kelly Marchisio,Edoardo M. Ponti

Main category: cs.CL

TL;DR: 稀疏注意力是扩展Transformer LLMs长上下文能力的有效方法，但其效率-准确性权衡及系统性扩展研究尚未深入。本文通过实验揭示了稀疏注意力在不同模型规模、序列长度和稀疏度下的表现，并提出了针对稀疏注意力的新缩放规律。

Details

Motivation: 探索稀疏注意力在Transformer LLMs中的可行性、效率-准确性权衡及其系统性扩展，填补现有研究的空白。 Method: 通过训练无关的稀疏注意力方法，在不同模型规模、序列长度和稀疏度下进行实验，涵盖多样化的长序列任务。 Result: 1) 长序列下，大且高度稀疏的模型优于小且密集的模型；2) 解码阶段的稀疏度上限高于预填充阶段，且与模型规模相关；3) 不同任务和阶段需不同稀疏策略；4) 提出了针对稀疏注意力的新缩放规律。 Conclusion: 稀疏注意力是增强Transformer LLMs处理长序列能力的关键工具，但需权衡性能与准确性。 Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

cs.RO [Back]

[109] BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring

Asier Bikandi,Muhammad Shaheer,Hriday Bavle,Jayan Jevanesan,Holger Voos,Jose Luis Sanchez-Lopez

Main category: cs.RO

TL;DR: 提出了一种基于BIM的漂移校正方法，通过将现实环境中的平面与BIM中的平面对齐，优化SLAM与BIM之间的转换，显著减少建筑监测中的漂移误差。

Details

Motivation: 建筑工地环境复杂，传统跟踪方法因特征缺失和动态变化导致数字模型与现实世界对齐不准确，影响AR应用效果。 Method: 利用BIM作为先验结构知识，将现实检测的平面与BIM中的平面匹配，通过优化技术计算SLAM与BIM坐标系间的转换，减少漂移。 Result: 实验表明，该方法显著减少了漂移误差，平均角度偏差减少52.24%，距离误差减少60.8%。 Conclusion: BIM感知的漂移校正方法能有效提升建筑监测中AR的长期定位和可视化精度。 Abstract: Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align ``as-built" detected planes from the real-world environment with ``as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.

physics.plasm-ph [Back]

[110] Plasma State Monitoring and Disruption Characterization using Multimodal VAEs

Yoeri Poels,Alessandro Pau,Christian Donner,Giulio Romanelli,Olivier Sauter,Cristina Venturini,Vlado Menkovski,the TCV team,the WPTE team

Main category: physics.plasm-ph

TL;DR: 论文提出了一种基于变分自编码器（VAE）的数据驱动方法，用于表征等离子体状态的可解释表示，以预测和区分托卡马克中的等离子体破裂。

Details

Motivation: 等离子体破裂是托卡马克未来设备的关键挑战之一，但目前对其理解有限，且数据驱动模型的解释性不足。 Method: 扩展了VAE框架，用于连续投影等离子体轨迹、多模态结构分离操作区域，以及区分破裂区域。 Result: 方法在约1600次TCV放电数据上验证，能有效识别不同操作区域及其与破裂的关联。 Conclusion: 该方法能以可解释的方式识别接近破裂的不同操作区域，并支持后续分析。 Abstract: When a plasma disrupts in a tokamak, significant heat and electromagnetic loads are deposited onto the surrounding device components. These forces scale with plasma current and magnetic field strength, making disruptions one of the key challenges for future devices. Unfortunately, disruptions are not fully understood, with many different underlying causes that are difficult to anticipate. Data-driven models have shown success in predicting them, but they only provide limited interpretability. On the other hand, large-scale statistical analyses have been a great asset to understanding disruptive patterns. In this paper, we leverage data-driven methods to find an interpretable representation of the plasma state for disruption characterization. Specifically, we use a latent variable model to represent diagnostic measurements as a low-dimensional, latent representation. We build upon the Variational Autoencoder (VAE) framework, and extend it for (1) continuous projections of plasma trajectories; (2) a multimodal structure to separate operating regimes; and (3) separation with respect to disruptive regimes. Subsequently, we can identify continuous indicators for the disruption rate and the disruptivity based on statistical properties of measurement data. The proposed method is demonstrated using a dataset of approximately 1600 TCV discharges, selecting for flat-top disruptions or regular terminations. We evaluate the method with respect to (1) the identified disruption risk and its correlation with other plasma properties; (2) the ability to distinguish different types of disruptions; and (3) downstream analyses. For the latter, we conduct a demonstrative study on identifying parameters connected to disruptions using counterfactual-like analysis. Overall, the method can adequately identify distinct operating regimes characterized by varying proximity to disruptions in an interpretable manner.

cs.SI [Back]

[111] S2Vec: Self-Supervised Geospatial Embeddings

Shushman Choudhury,Elad Aharoni,Chandrakumari Suvarna,Iveel Tsogsuren,Abdul Rahman Kreidieh,Chun-Ta Lu,Neha Arora

Main category: cs.SI

TL;DR: S2Vec是一种自监督学习框架，用于生成通用的地理空间嵌入表示，通过S2 Geometry库分区并栅格化特征向量，结合掩码自编码技术，在多项社会经济预测任务中表现优异。

Details

Motivation: 构建可扩展的通用地理空间表示对地理空间人工智能应用至关重要。 Method: 使用S2 Geometry库分区地理区域为离散S2单元，栅格化特征向量为图像，并应用掩码自编码技术生成嵌入表示。 Result: 在多项社会经济预测任务中表现优于现有图像嵌入方法，且与图像嵌入结合可进一步提升性能。 Conclusion: S2Vec能生成有效的地理空间表示，并与其他数据模态互补，提升地理空间人工智能应用性能。 Abstract: Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on three large-scale socioeconomic prediction tasks, showing its competitive performance against state-of-the-art image-based embeddings. We also explore the benefits of combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our results highlight how S2Vec can learn effective general-purpose geospatial representations and how it can complement other data modalities in geospatial artificial intelligence.

cs.CY [Back]

[112] Seeing The Words: Evaluating AI-generated Biblical Art

Hidde Makimei,Shuai Wang,Willem van Peursen

Main category: cs.CY

TL;DR: 论文探讨了AI生成基于圣经文本的图像的能力，并提供了一个包含7K图像的数据集，通过多种神经网络工具评估其准确性和宗教、美学视角的分析。

Details

Motivation: 研究AI是否能根据圣经文本生成符合其背景和语境的图像，填补系统性评估的空白。 Method: 构建大型数据集（7K图像），使用多种神经网络工具评估生成图像的多方面表现。 Result: 提供了对生成图像准确性、宗教背景符合度及美学价值的评估和分析。 Conclusion: 讨论了生成图像的应用，并反思了AI生成器的表现。 Abstract: The past years witnessed a significant amount of Artificial Intelligence (AI) tools that can generate images from texts. This triggers the discussion of whether AI can generate accurate images using text from the Bible with respect to the corresponding biblical contexts and backgrounds. Despite some existing attempts at a small scale, little work has been done to systematically evaluate these generated images. In this work, we provide a large dataset of over 7K images using biblical text as prompts. These images were evaluated with multiple neural network-based tools on various aspects. We provide an assessment of accuracy and some analysis from the perspective of religion and aesthetics. Finally, we discuss the use of the generated images and reflect on the performance of the AI generators.

eess.IV [Back]

[113] Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentation

Valentin Langer,Kartikay Tehlan,Thomas Wendler

Main category: eess.IV

TL;DR: 提出一种基于多器官分割的方法，整合多个血管区域的输入函数，以提高动态PET成像的动力学建模准确性。

Details

Motivation: 传统方法仅从主动脉获取输入函数，忽略了解剖变异和复杂血管贡献，限制了建模的准确性。 Method: 通过高分辨率CT分割肝脏、肺、肾脏和膀胱，整合主动脉、门静脉、肺动脉和输尿管的输入函数。 Result: 在九名患者的动态PET数据中，肝脏和肺的均方误差分别降低了13.39%和10.42%。 Conclusion: 多输入函数方法有望改善解剖建模，推动示踪动力学模型在临床中的应用。 Abstract: Accurate kinetic analysis of [$^{18}$F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [$^{18}$F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of $13.39\%$ for the liver and $10.42\%$ for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.

[114] Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PET

Kartikay Tehlan,Thomas Wendler

Main category: eess.IV

TL;DR: 提出了一种基于隐式神经表示（INRs）的个性化动力学参数估计方法，用于动态PET成像，解决了传统方法计算量大和数据需求高的问题。

Details

Motivation: 传统PET动力学参数估计方法计算量大且空间分辨率有限，而深度神经网络（DNNs）需要大量训练数据和计算资源。 Method: 利用INRs学习连续函数，结合3D CT基础模型的解剖先验，实现高效、高分辨率的参数成像。 Result: 在动态PET/CT数据集上验证，结果显示更高的空间分辨率、更低的均方误差和更好的解剖一致性。 Conclusion: INRs为个性化、数据高效的示踪动力学建模提供了潜力，适用于肿瘤特征分析、分割和预后评估。 Abstract: Dynamic positron emission tomography (PET) with [$^{18}$F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [$^{18}$F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.

[115] A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology

Hassan Keshvarikhojasteh,Mihail Tifrea,Sibylle Hess,Josien P. W. Pluim,Mitko Veta

Main category: eess.IV

TL;DR: GABMIL通过显式建模实例间依赖关系，提升了ABMIL的性能，同时保持计算效率。

Details

Motivation: 传统MIL方法（如ABMIL）忽略了病理诊断中关键的patch间空间交互，而TransMIL虽引入空间上下文但计算复杂度高。 Method: 在ABMIL框架中集成交互感知表示，提出GABMIL模型，显式捕获实例间依赖关系。 Result: 在乳腺癌和肺癌亚型分类任务中，GABMIL比ABMIL在AUPRC和Kappa分数上分别提升7%和5%。 Conclusion: 在MIL框架中显式建模patch交互对性能提升至关重要，且无需显著增加计算成本。 Abstract: Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks.

[116] Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization

Abderrachid Hamrani,Daniela Leizaola,Renato Sousa,Jose P. Ponce,Stanley Mathis,David G. Armstrong,Anuradha Godavarty

Main category: eess.IV

TL;DR: ADZUS是一种基于文本引导的扩散模型，用于糖尿病足溃疡的无监督分割，无需标注数据，性能优于传统方法。

Details

Motivation: 糖尿病足溃疡的精确评估对改善患者疗效至关重要，传统深度学习方法依赖大量标注数据，ADZUS通过零样本学习提供更灵活的解决方案。 Method: ADZUS利用文本引导的扩散模型，通过描述性提示动态调整分割，无需标注训练数据。 Result: 在慢性伤口数据集上，ADZUS的IoU达86.68%，精度94.69%，显著优于FUSegNet；在DFU数据集上DSC为75%，远超FUSegNet的45%。 Conclusion: ADZUS为医学影像提供了一种高效、可扩展的AI解决方案，但计算成本和潜在微调需求仍需改进。 Abstract: Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68\% and the highest precision of 94.69\% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing FUSegNet's 45\%. The model's text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.

cs.LG [Back]

[117] (Im)possibility of Automated Hallucination Detection in Large Language Models

Amin Karbasi,Omar Montasser,John Sous,Grigoris Velegkas

Main category: cs.LG

TL;DR: 本文探讨了自动检测大型语言模型（LLM）幻觉的可行性，提出理论框架并证明仅依赖正确样本训练检测器时，幻觉检测基本不可行；但加入专家标注的负样本后，检测变得可能。

Details

Motivation: 研究自动检测LLM幻觉的可行性，为可靠部署LLM提供理论支持。 Method: 基于Gold-Angluin框架，将幻觉检测与语言识别任务等价，分析不同训练数据（仅正确样本 vs. 专家标注正负样本）对检测能力的影响。 Result: 仅用正确样本时，幻觉检测不可行；加入负样本后，检测对所有可数语言集合均可行。 Conclusion: 专家标注的负样本对幻觉检测至关重要，支持基于反馈的方法（如RLHF）。 Abstract: Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language $K$ (selected from a countable collection) and given access to an LLM, can reliably determine whether the LLM's outputs are correct or constitute hallucinations. First, we establish an equivalence between hallucination detection and the classical task of language identification. We prove that any hallucination detection method can be converted into a language identification method, and conversely, algorithms solving language identification can be adapted for hallucination detection. Given the inherent difficulty of language identification, this implies that hallucination detection is fundamentally impossible for most language collections if the detector is trained using only correct examples from the target language. Second, we show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements), dramatically changes this conclusion. Under this enriched training regime, automated hallucination detection becomes possible for all countable language collections. These results highlight the essential role of expert-labeled examples in training hallucination detectors and provide theoretical support for feedback-based methods, such as reinforcement learning with human feedback (RLHF), which have proven critical for reliable LLM deployment.

[118] HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models

Jun Zhang,Jue Wang,Huan Li,Lidan Shou,Ke Chen,Gang Chen,Qin Xie,Guiming Xie,Xuejian Gong

Main category: cs.LG

TL;DR: HMI是一种基于分层知识管理的多租户推理系统，旨在高效管理不同PLM租户的资源使用。通过分层知识管理和系统优化，HMI在单GPU上支持多达10,000个hPLM，且精度损失可忽略。

Details

Motivation: 预训练语言模型（PLM）的高计算需求在多租户环境中难以高效服务，HMI旨在解决这一问题。 Method: 1. 将PLM知识分为通用、领域特定和任务特定三类，构建分层PLM（hPLM）以减少GPU内存使用。2. 通过频率更新领域知识树和参数交换管理任务特定知识。3. 系统优化包括分层知识预取和批量矩阵乘法并行实现。 Result: 实验表明，HMI在单GPU上支持10,000个hPLM（hBERT和hGPT），精度损失极小。 Conclusion: HMI通过分层知识管理和系统优化，显著提升了多租户环境中PLM的资源效率和推理吞吐量。 Abstract: The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.

[119] Unsupervised Time-Series Signal Analysis with Autoencoders and Vision Transformers: A Review of Architectures and Applications

Hossein Ahmadi,Sajjad Emdadi Mahdimahalleh,Arman Farahat,Banafsheh Saffari

Main category: cs.LG

TL;DR: 本文综述了自编码器和视觉变换器在无监督信号分析中的应用，探讨了其架构、应用及新兴趋势，并指出了可解释性、扩展性和领域泛化等挑战。

Details

Motivation: 随着无线通信、雷达、生物医学工程和物联网等领域中未标记时间序列数据的快速增长，无监督学习的需求推动了相关技术的进步。 Method: 通过自编码器和视觉变换器进行特征提取、异常检测和分类，并研究混合架构和自监督学习的优势。 Result: 综述展示了这些模型在多种信号类型（如心电图、雷达波形和物联网传感器数据）中的应用潜力，同时揭示了当前技术的局限性。 Conclusion: 本文为开发鲁棒、自适应的信号智能模型提供了路线图，强调了方法创新与实际应用的结合。 Abstract: The rapid growth of unlabeled time-series data in domains such as wireless communications, radar, biomedical engineering, and the Internet of Things (IoT) has driven advancements in unsupervised learning. This review synthesizes recent progress in applying autoencoders and vision transformers for unsupervised signal analysis, focusing on their architectures, applications, and emerging trends. We explore how these models enable feature extraction, anomaly detection, and classification across diverse signal types, including electrocardiograms, radar waveforms, and IoT sensor data. The review highlights the strengths of hybrid architectures and self-supervised learning, while identifying challenges in interpretability, scalability, and domain generalization. By bridging methodological innovations and practical applications, this work offers a roadmap for developing robust, adaptive models for signal intelligence.

[120] OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection

Alberto Fernández-Hernández,Jose I. Mestre,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí

Main category: cs.LG

TL;DR: OUI是一种新工具，用于监测DNN训练动态并优化正则化超参数，无需验证数据即可判断过拟合或欠拟合。

Details

Motivation: 传统方法依赖验证数据调整超参数，效率低且耗时。OUI旨在提供更快速、更准确的超参数选择方法。 Method: 通过实验验证OUI在多种DNN架构（如DenseNet、EfficientNet、ResNet）和数据集（如CIFAR-100、TinyImageNet、ImageNet-1K）上的有效性。 Result: OUI能更快收敛，显著提升泛化能力和验证分数，且能早期识别最优超参数值。 Conclusion: OUI是一种高效工具，可优化正则化超参数选择，提升模型性能。 Abstract: We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at https://github.com/AlbertoFdezHdez/OUI.

[121] Group Downsampling with Equivariant Anti-aliasing

Md Ashiqur Rahman,Raymond A. Yeh

Main category: cs.LG

TL;DR: 论文研究了在群等变架构（如G-CNNs）中均匀下采样层的泛化问题，提出了一种基于有限群和抗混叠的下采样方法。

Details

Motivation: 下采样层是CNN架构中的关键组件，但现有方法在群等变架构中的泛化能力有限。本文旨在解决这一问题。 Method: 提出了一种算法来选择适合的子群，并研究了带限性和抗混叠方法，推广了经典采样理论中的下采样概念。 Result: 实验表明，该方法在图像分类任务中提高了准确性，更好地保持了等变性，并减少了模型大小。 Conclusion: 该方法成功地将下采样操作泛化到群等变架构中，具有实际应用价值。 Abstract: Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features and reduce the amount of memory/computation in the model. In this work, we study the generalization of the uniform downsampling layer for group equivariant architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature maps) on general finite groups with anti-aliasing. This involves the following: (a) Given a finite group and a downsampling rate, we present an algorithm to form a suitable choice of subgroup. (b) Given a group and a subgroup, we study the notion of bandlimited-ness and propose how to perform anti-aliasing. Notably, our method generalizes the notion of downsampling based on classical sampling theory. When the signal is on a cyclic group, i.e., periodic, our method recovers the standard downsampling of an ideal low-pass filter followed by a subsampling operation. Finally, we conducted experiments on image classification tasks demonstrating that the proposed downsampling operation improves accuracy, better preserves equivariance, and reduces model size when incorporated into G-equivariant networks

[122] Class-Conditional Distribution Balancing for Group Robust Classification

Miaoyun Zhao,Qiang Zhang,Chenrong Li

Main category: cs.LG

TL;DR: 论文提出了一种无需偏置标注或预测的鲁棒学习方法，通过重新加权样本平衡类条件分布，消除虚假相关性。

Details

Motivation: 虚假相关性导致模型基于错误原因做出预测，现有方法依赖昂贵的偏置标注或大规模预训练模型，难以适用于资源有限的领域。 Method: 通过减少虚假因素与标签信息的互信息，采用样本重新加权策略平衡类条件分布，自动突出少数群体和类别。 Result: 实验表明，该方法性能优越，媲美依赖偏置监督的方法。 Conclusion: 该方法简单有效，无需额外标注或数据，适用于资源受限场景，显著提升了模型的鲁棒性。 Abstract: Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of reducing the mutual information between spurious factors and label information, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.

[123] The effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networks

Nikita Gabdullin

Main category: cs.LG

TL;DR: 本文研究了神经网络Hessian矩阵特征值谱密度（HESD）的类型及其对泛化能力的影响，提出了判断HESD类型的条件，并统一了HESD分析方法。

Details

Motivation: 探讨HESD行为对神经网络泛化能力的影响，以及外部梯度操作对HESD类型的影响。 Method: 通过实验分析不同优化器、数据集和预处理方法下的HESD类型，并提出判断条件和统一分析方法。 Result: 发现HESD主要为正（MP-HESD）或负（MN-HESD），并揭示了准奇异HESD的存在及其影响。 Conclusion: 提出统一的HESD分析方法，并讨论了HESD变化及其对传统假设的挑战。 Abstract: Hessians of neural network (NN) contain essential information about the curvature of NN loss landscapes which can be used to estimate NN generalization capabilities. We have previously proposed generalization criteria that rely on the observation that Hessian eigenvalue spectral density (HESD) behaves similarly for a wide class of NNs. This paper further studies their applicability by investigating factors that can result in different types of HESD. We conduct a wide range of experiments showing that HESD mainly has positive eigenvalues (MP-HESD) for NN training and fine-tuning with various optimizers on different datasets with different preprocessing and augmentation procedures. We also show that mainly negative HESD (MN-HESD) is a consequence of external gradient manipulation, indicating that the previously proposed Hessian analysis methodology cannot be applied in such cases. We also propose criteria and corresponding conditions to determine HESD type and estimate NN generalization potential. These HESD types and previously proposed generalization criteria are combined into a unified HESD analysis methodology. Finally, we discuss how HESD changes during training, and show the occurrence of quasi-singular (QS) HESD and its influence on the proposed methodology and on the conventional assumptions about the relation between Hessian eigenvalues and NN loss landscape curvature.

[124] Aerial Image Classification in Scarce and Unconstrained Environments via Conformal Prediction

Farhad Pourkamali-Anaraki

Main category: cs.LG

TL;DR: 本文对保形预测方法在复杂真实环境下的航空图像数据集上进行了全面的实证分析，探讨了其在数据稀缺和高变异性场景中的有效性。

Details

Motivation: 研究动机在于评估保形预测方法在非标准基准（如数据稀缺和高度变化的真实环境）中的表现，并探索预训练模型和校准技术对预测性能的影响。 Method: 方法包括使用预训练模型（MobileNet、DenseNet、ResNet）进行微调，结合两种校准流程（带和不带温度缩放），并通过覆盖率和预测集大小评估性能。 Result: 结果表明，即使在小样本和简单非一致性分数下，保形预测仍能提供有价值的预测集；温度缩放虽常用，但未必能缩小预测集。 Conclusion: 结论强调了未来研究应关注噪声标签对保形预测的影响，并探索模型压缩技术在资源受限环境中的应用。 Abstract: This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.

cs.MM [Back]

[125] Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

Yusheng Zhao,Junyu Luo,Xiao Luo,Weizhi Zhang,Zhiping Xiao,Wei Ju,Philip S. Yu,Ming Zhang

Main category: cs.MM

TL;DR: 本文对多模态大语言模型（MLLMs）的视听能力进行了多维度评估，发现其在零样本和小样本任务中表现优异，但对视觉模态依赖性强，且在对抗样本中表现脆弱。

Details

Motivation: 目前缺乏对MLLMs在视听能力上的全面评估，尤其是在分布偏移和对抗攻击等多样化场景中。 Method: 通过四个关键维度（有效性、效率、泛化性和鲁棒性）对MLLMs进行多方面的实验评估。 Result: MLLMs在零样本和小样本任务中表现优异，但对视觉模态依赖性强，且在对抗样本中表现脆弱。 Conclusion: 研究结果揭示了MLLMs的视听能力，为未来改进和研究提供了指导。 Abstract: Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.

cs.SE [Back]

[126] SCALAR: A Part-of-speech Tagger for Identifiers

Christian D. Newman,Brandon Scholten,Sophia Testa,Joshua A. C. Behler,Syreen Banabilah,Michael L. Collard,Michael J. Decker,Mohamed Wiem Mkaouer,Marcos Zampieri,Eman Abdullah AlOmar,Reem Alsuhaibani,Anthony Peruma,Jonathan I. Maletic

Main category: cs.SE

TL;DR: SCALAR是一种工具，用于将源代码标识符名称映射到其对应的词性标记序列，通过训练模型改进标注效果。

Details

Motivation: 开发者使用的自然语言结构独特，现有词性标注工具对源代码标识符的标注效果不佳，需专门工具改进。 Method: 使用scikit-learn的GradientBoostingClassifier训练模型，结合手动整理的标识符名称和语法模式数据集。 Result: SCALAR在标注标识符方面优于旧版标注工具和现代通用词性标注工具。 Conclusion: SCALAR为源代码标识符的词性标注提供了更准确的解决方案，代码已开源。 Abstract: The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github

q-bio.QM [Back]

[127] Automating tumor-infiltrating lymphocyte assessment in breast cancer histopathology images using QuPath: a transparent and accessible machine learning pipeline

Masoud Tafavvoghi,Lars Ailo Bongo,André Berli Delgado,Nikita Shvetsov,Anders Sildnes,Line Moi,Lill-Tove Rasmussen Busund,Kajsa Møllersen

Main category: q-bio.QM

TL;DR: 开发了一个基于QuPath的端到端肿瘤浸润淋巴细胞（TILs）评估流程，利用现有工具实现自动化分析。

Details

Motivation: 探索利用易获取工具完成复杂任务的可能性，为乳腺癌H&E染色全切片图像中的TILs评估提供实用解决方案。 Method: 1. 训练像素分类器分割肿瘤和基质；2. 使用预训练StarDist模型检测细胞并训练二分类器区分TILs；3. 计算TIL密度并分类。 Result: 与病理学家评分相比，Cohen's kappa为0.71，验证了流程的有效性。 Conclusion: 现有软件可为乳腺癌H&E切片中的TILs评估提供实用解决方案。 Abstract: In this study, we built an end-to-end tumor-infiltrating lymphocytes (TILs) assessment pipeline within QuPath, demonstrating the potential of easily accessible tools to perform complex tasks in a fully automatic fashion. First, we trained a pixel classifier to segment tumor, tumor-associated stroma, and other tissue compartments in breast cancer H&E-stained whole-slide images (WSI) to isolate tumor-associated stroma for subsequent analysis. Next, we applied a pre-trained StarDist deep learning model in QuPath for cell detection and used the extracted cell features to train a binary classifier distinguishing TILs from other cells. To evaluate our TILs assessment pipeline, we calculated the TIL density in each WSI and categorized them as low, medium, or high TIL levels. Our pipeline was evaluated against pathologist-assigned TIL scores, achieving a Cohen's kappa of 0.71 on the external test set, corroborating previous research findings. These results confirm that existing software can offer a practical solution for the assessment of TILs in H&E-stained WSIs of breast cancer.

q-bio.NC [Back]

[128] Can deep neural networks learn biological vision?

Drew Linsley,Pinyuan Feng,Thomas Serre

Main category: q-bio.NC

TL;DR: DNNs与灵长类神经反应的早期对齐趋势已逆转，现代DNN依赖不同视觉特征。未来生物视觉模型需脱离AI，采用更接近人类视觉的训练方法。

Details

Motivation: 探讨DNNs与生物视觉对齐趋势的逆转原因，提出未来模型需更贴近人类视觉机制。 Method: 分析DNNs与灵长类神经反应的差异，提出改进方向。 Result: 现代DNNs依赖不同视觉特征，与生物视觉分歧。 Conclusion: 未来生物视觉模型需采用更接近人类视觉的训练数据和方法。 Abstract: Deep neural networks (DNNs) once showed increasing alignment with primate neural responses as they improved on computer vision benchmarks. This trend raised the exciting possibility that better models of biological vision would come as a byproduct of the deep learning revolution in artificial intelligence. However, the trend has reversed over recent years as DNNs have scaled to human or superhuman recognition accuracy, a divergence that may stem from modern DNNs learning to rely on different visual features than primates to solve tasks. Where will better computational models of biological vision come from? We propose that vision science must break from artificial intelligence to develop algorithms that are designed with biological visual systems in mind instead of internet data benchmarks. We predict that the next generation of deep learning models of biological vision will be trained with data diets, training routines, and objectives that are closer to those that shape human vision than those that are in use today.

cs.AI [Back]

[129] A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions

Emre Can Acikgoz,Cheng Qian,Hongru Wang,Vardhan Dongre,Xiusi Chen,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur

Main category: cs.AI

TL;DR: 本文综述了基于大语言模型（LLMs）的对话代理的现状、挑战及未来发展方向，提出了一个分类框架，并指出了关键研究缺口。

Details

Motivation: 探讨LLMs驱动的对话代理的能力、局限性和未来路径，为更接近人类智能的可扩展系统提供指导。 Method: 通过将对话代理的能力分为推理、监控和控制三个维度，提出新的分类法，并系统分析现有研究。 Result: 识别了研究缺口，包括真实评估、长期多轮推理、自我进化能力等，并提出了未来研究方向。 Conclusion: 本文为对话代理的研究提供了结构化基础，并推动了人工通用智能（AGI）的发展。 Abstract: Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: https://github.com/emrecanacikgoz/awesome-conversational-agents.

[130] AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models

Mohammad Zarei,Melanie A Jutras,Eliana Evans,Mike Tan,Omid Aaramoon

Main category: cs.AI

TL;DR: 论文提出了一种利用生成和可解释AI技术解决自动驾驶车辆中罕见故障模式（RFMs）的新方法，通过生成多样化环境图像和自然语言描述，提升系统的鲁棒性和可靠性。

Details

Motivation: 自动驾驶车辆（AVs）依赖AI检测物体，但难以识别罕见故障模式（RFMs），即“长尾挑战”。本文旨在解决这一问题。 Method: 提取对象分割掩码并反转生成环境掩码，结合文本提示输入定制扩散模型，利用Stable Diffusion修复模型和对抗噪声优化生成多样化图像，暴露AI系统漏洞。 Result: 生成包含RFMs的图像和自然语言描述，帮助开发者和政策制定者改进AV系统的安全性和可靠性。 Conclusion: 该方法有效提升了自动驾驶系统对罕见故障模式的识别能力，为AI系统的鲁棒性和可靠性提供了新思路。 Abstract: Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately detect objects and interpret their surroundings. However, even when trained using millions of miles of real-world data, AVs are often unable to detect rare failure modes (RFMs). The problem of RFMs is commonly referred to as the "long-tail challenge", due to the distribution of data including many instances that are very rarely seen. In this paper, we present a novel approach that utilizes advanced generative and explainable AI techniques to aid in understanding RFMs. Our methods can be used to enhance the robustness and reliability of AVs when combined with both downstream model training and testing. We extract segmentation masks for objects of interest (e.g., cars) and invert them to create environmental masks. These masks, combined with carefully crafted text prompts, are fed into a custom diffusion model. We leverage the Stable Diffusion inpainting model guided by adversarial noise optimization to generate images containing diverse environments designed to evade object detection models and expose vulnerabilities in AI systems. Finally, we produce natural language descriptions of the generated RFMs that can guide developers and policymakers to improve the safety and reliability of AV systems.

Table of Contents

cs.CV [Back]

[1] Dense Air Pollution Estimation from Sparse in-situ Measurements and Satellite Data

[2] DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

[3] PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation

[4] Distilling semantically aware orders for autoregressive image generation

[5] Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection

[6] Transferring Spatial Filters via Tangent Space Alignment in Motor Imagery BCIs

[7] Latent Video Dataset Distillation

[8] A Comprehensive Review on RNA Subcellular Localization Prediction

[9] PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition

[10] A Genealogy of Multi-Sensor Foundation Models in Remote Sensing

[11] We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

[12] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

[13] MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing

[14] Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion

[15] Visual and textual prompts for enhancing emotion recognition in video

[16] Range Image-Based Implicit Neural Compression for LiDAR Point Clouds

[17] Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment

[18] DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks

[19] Precision Neural Network Quantization via Learnable Adaptive Modules

[20] Towards Generalized and Training-Free Text-Guided Semantic Manipulation

[21] EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

[22] Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+

[23] DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

[24] TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

[25] DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

[26] I-INR: Iterative Implicit Neural Representations

[27] TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

[28] Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

[29] SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

[30] Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models

[31] S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception

[32] StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

[33] 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

[34] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

[35] Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

[36] FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

[37] Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

[38] Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data

[39] RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

[40] Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation

[41] Towards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse Scenarios

[42] ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting

[43] Text-to-Image Alignment in Denoising-Based Models through Step Selection

[44] An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization Algorithm

[45] When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering

[46] A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

[47] Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior

[48] Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images

[49] Tamper-evident Image using JPEG Fixed Points

[50] RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network

[51] STCL:Curriculum learning Strategies for deep learning image steganography models

[52] Enhancing CNNs robustness to occlusions with bioinspired filters for border completion

[53] Improving Open-World Object Localization by Discovering Background

[54] A Guide to Structureless Visual Localization

[55] CLIPSE -- a minimalistic CLIP-based image search engine for research

[56] DiMeR: Disentangled Mesh Reconstruction Model

[57] PICO: Reconstructing 3D People In Contact with Objects

[58] Hierarchical and Multimodal Data for Daily Activity Understanding

[59] Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields

[60] DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model

[61] EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor

[62] Step1X-Edit: A Practical Framework for General Image Editing

[63] The Fourth Monocular Depth Estimation Challenge

[64] Dynamic Camera Poses and Where to Find Them

[65] Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

[66] LiDPM: Rethinking Point Diffusion for Lidar Scene Completion

cs.GR [Back]

[67] ePBR: Extended PBR Materials in Image Synthesis

[68] Bolt: Clothing Virtual Characters at Scale

[69] CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos

cs.CL [Back]

[70] Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

[71] Tokenization Matters: Improving Zero-Shot NER for Indic Languages

[72] Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

[73] Do Words Reflect Beliefs? Evaluating Belief Depth in Large Language Models

[74] Agree to Disagree? A Meta-Evaluation of LLM Misgendering

[75] How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study

[76] Co-CoT: A Prompt-Based Framework for Collaborative Chain-of-Thought Reasoning