cs.CV [Total: 97]
cs.GR [Total: 7]
cs.CL [Total: 79]
cs.AI [Total: 6]
eess.AS [Total: 1]
cs.DB [Total: 1]
cs.CR [Total: 1]
q-bio.BM [Total: 1]
eess.IV [Total: 7]
q-bio.NC [Total: 1]
cs.HC [Total: 2]
cs.RO [Total: 1]
math.NA [Total: 1]
cs.LG [Total: 18]
cs.CY [Total: 1]
cs.SD [Total: 1]
q-fin.ST [Total: 1]
cs.IR [Total: 1]

cs.CV [Back]

[1] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li,Kaixin Xiong,Xiangyu Guo,Fang Li,Sixu Yan,Gangwei Xu,Lijun Zhou,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: ReCogDrive提出了一种结合视觉语言模型（VLM）和扩散规划器的自动驾驶系统，通过三阶段训练解决领域差距、维度不匹配和模仿学习问题，并在NAVSIM基准测试中取得新SOTA。

Details

Motivation: 现有端到端自动驾驶在罕见和长尾场景中性能下降，且现有VLM方法存在领域差距、维度不匹配和模仿学习问题。 Method: 三阶段训练：1）用驾驶问答数据集训练VLM；2）扩散规划器进行模仿学习；3）强化学习微调。 Result: 在NAVSIM基准测试中PDMS达到89.6，超越此前SOTA 5.6 PDMS。 Conclusion: ReCogDrive通过结合VLM和扩散规划器，显著提升了自动驾驶在复杂场景中的性能。 Abstract: Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, and (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an autonomous driving system that integrates VLMs with diffusion planner, which adopts a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering datasets to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning with NAVSIM non-reactive simulator, enabling the model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 5.6 PDMS.

[2] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

Aniket Rege,Zinnia Nie,Mahesh Ramesh,Unmesh Raskar,Zhuoran Yu,Aditya Kusupati,Yong Jae Lee,Ramya Korlakai Vinayak

Main category: cs.CV

TL;DR: 论文提出了CuRe，一种用于评估文本到图像（T2I）系统文化代表性的新基准和评分套件，解决了现有系统对全球南方文化的偏见问题。

Details

Motivation: 现有T2I系统训练数据偏向欧美文化，忽视了全球南方文化的多样性，需要一种方法来量化这种偏见。 Method: CuRe通过利用属性规范的边际效用作为人类判断的代理，构建了一个基于维基媒体知识图谱的文化层次数据集，包含300个文化物品和6个文化轴。 Result: CuRe评分器在多种图像编码器、视觉语言模型和T2I系统中表现出与人类判断的高度相关性。 Conclusion: CuRe为评估和改进T2I系统的文化多样性提供了有效工具，其数据集和代码已开源。 Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset's categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/.

[3] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

Oishee Bintey Hoque,Abhijin Adiga,Aniruddha Adiga,Siddharth Chaudhary,Madhav V. Marathe,S. S. Ravi,Kirti Rajagopalan,Amanda Wilson,Samarth Swarup

Main category: cs.CV

TL;DR: IGraSS是一个结合语义分割和图优化的框架，用于改进不完整的地面真值并准确绘制运河网络。

Details

Motivation: 现有语义分割模型依赖高质量标注数据，但地面真值不完整会影响性能。基础设施网络（如运河）具有图级特性（如可达性），可用于优化地面真值。 Method: IGraSS结合RGB和多模态（NDWI、DEM）的语义分割模块与基于图的真值优化模块，迭代提升结果。 Result: 实验表明，IGraSS将不可达运河段从18%降至3%，优化后的真值显著提升识别性能。 Conclusion: IGraSS是一个通用框架，适用于优化噪声地面真值和从遥感图像中绘制基础设施网络。 Abstract: Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

[4] Spectral Domain Neural Reconstruction for Passband FMCW Radars

Harshvardhan Takawale,Nirupam Roy

Main category: cs.CV

TL;DR: SpINRv2是一个基于神经网络的框架，用于通过FMCW雷达实现高保真体积重建，改进了前作SpINR，解决了高频下的相位混叠和子区间模糊问题。

Details

Motivation: 高频FMCW雷达在体积重建中面临相位混叠和子区间模糊的挑战，现有方法难以处理这些问题。 Method: 提出了一种完全可微的频率域前向模型，结合隐式神经表示（INR）进行连续体积建模，并引入稀疏性和平滑性正则化。 Result: 实验表明，SpINRv2在高频下显著优于经典和基于学习的方法，成为神经雷达3D成像的新标杆。 Conclusion: SpINRv2通过频率域建模和正则化技术，有效解决了高频雷达重建的挑战，提升了性能。 Abstract: We present SpINRv2, a neural framework for high-fidelity volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar. Extending our prior work (SpINR), this version introduces enhancements that allow accurate learning under high start frequencies-where phase aliasing and sub-bin ambiguity become prominent. Our core contribution is a fully differentiable frequency-domain forward model that captures the complex radar response using closed-form synthesis, paired with an implicit neural representation (INR) for continuous volumetric scene modeling. Unlike time-domain baselines, SpINRv2 directly supervises the complex frequency spectrum, preserving spectral fidelity while drastically reducing computational overhead. Additionally, we introduce sparsity and smoothness regularization to disambiguate sub-bin ambiguities that arise at fine range resolutions. Experimental results show that SpINRv2 significantly outperforms both classical and learning-based baselines, especially under high-frequency regimes, establishing a new benchmark for neural radar-based 3D imaging.

[5] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework

Huixin Zhan,Jason H. Moore

Main category: cs.CV

TL;DR: 该论文提出了一种基于离散扩散框架和视觉-语言-动作（VLA）管道的方法，用于建模外科医生在机器人手术中的个性化操作风格，并通过自然语言提示实现隐私保护。

Details

Motivation: 当前AI系统常忽略外科医生的个性化操作风格，而个性化信号对手术建模至关重要。 Method: 采用离散扩散框架结合VLA管道，将手势预测建模为结构化序列去噪任务，输入包括内窥镜视频、手术意图语言和隐私感知的外科医生身份与技能嵌入。 Result: 在JIGSAWS数据集上验证，方法能准确重建手势序列并学习到每位外科医生的独特运动指纹。但更个性化的嵌入会提高任务性能，同时增加身份泄露风险。 Conclusion: 个性化嵌入虽提升性能，但也增加隐私风险，需在手术建模中平衡个性化与隐私保护。 Abstract: Surgeons exhibit distinct operating styles due to differences in training, experience, and motor behavior - yet current AI systems often ignore this personalization signal. We propose a novel approach to model fine-grained, surgeon-specific fingerprinting in robotic surgery using a discrete diffusion framework integrated with a vision-language-action (VLA) pipeline. Our method formulates gesture prediction as a structured sequence denoising task, conditioned on multimodal inputs including endoscopic video, surgical intent language, and a privacy-aware embedding of surgeon identity and skill. Personalized surgeon fingerprinting is encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.

[6] Open World Scene Graph Generation using Vision Language Models

Amartya Dutta,Kazi Sajeed Mehrab,Medha Sawhney,Abhilash Neog,Mridul Khurana,Sepideh Fatemi,Aanish Pradhan,M. Maruf,Ismini Lourentzou,Arka Daw,Anuj Karpatne

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的开放世界场景图生成框架，利用预训练视觉语言模型直接生成场景图，支持零样本学习。

Details

Motivation: 现有方法依赖数据集特定监督或微调，限制了在开放世界中的适用性。本文旨在利用预训练模型的通用知识，实现无需额外学习的场景图生成。 Method: 结合多模态提示、嵌入对齐和轻量级对优化策略，将场景图生成转化为零样本结构化推理问题。 Result: 在Visual Genome、Open Images V6和PSG数据集上验证了预训练模型在无需任务级训练时的关系理解能力。 Conclusion: 预训练视觉语言模型具备零样本关系理解能力，为开放世界场景图生成提供了高效、模型无关的解决方案。 Abstract: Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.

[7] Generative Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes

Antoni Nowinowski,Krzysztof Krawiec

Main category: cs.CV

TL;DR: 扩展了DVP架构，使其能处理多物体场景，并通过潜在空间损失函数改进训练效果。

Details

Motivation: 解决原DVP无法处理多物体及图像空间重建损失训练困难的问题。 Method: 扩展DVP架构，引入潜在空间损失函数，并设计新基准数据集。 Result: 在重建质量和重叠物体分解能力上优于基线模型。 Conclusion: 通过潜在空间损失函数显著提升训练效果，但可微分渲染在自编码器中仍有局限。 Abstract: This study builds on the architecture of the Disentangler of Visual Priors (DVP), a type of autoencoder that learns to interpret scenes by decomposing the perceived objects into independent visual aspects of shape, size, orientation, and color appearance. These aspects are expressed as latent parameters which control a differentiable renderer that performs image reconstruction, so that the model can be trained end-to-end with gradient using reconstruction loss. In this study, we extend the original DVP so that it can handle multiple objects in a scene. We also exploit the interpretability of its latent by using the decoder to sample additional training examples and devising alternative training modes that rely on loss functions defined not only in the image space, but also in the latent space. This significantly facilitates training, which is otherwise challenging due to the presence of extensive plateaus in the image-space reconstruction loss. To examine the performance of this approach, we propose a new benchmark featuring multiple 2D objects, which subsumes the previously proposed Multi-dSprites dataset while being more parameterizable. We compare the DVP extended in these ways with two baselines (MONet and LIVE) and demonstrate its superiority in terms of reconstruction quality and capacity to decompose overlapping objects. We also analyze the gradients induced by the considered loss functions, explain how they impact the efficacy of training, and discuss the limitations of differentiable rendering in autoencoders and the ways in which they can be addressed.

[8] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Mateusz Michalkiewicz,Anekha Sokhal,Tadeusz Michalkiewicz,Piotr Pawlikowski,Mahsa Baktashmotlagh,Varun Jampani,Guha Balakrishnan

Main category: cs.CV

TL;DR: GIQ是一个用于评估视觉和视觉语言基础模型几何推理能力的综合基准，揭示了当前模型在几何理解上的显著不足。

Details

Motivation: 尽管单目3D重建方法和视觉语言模型在标准基准上表现优异，但其对几何属性的真实理解尚不明确。 Method: GIQ包含224种多样多面体的合成和真实图像，通过单目3D重建、3D对称检测、心理旋转测试和零样本形状分类任务进行系统实验。 Result: 当前模型在重建基本几何形状、详细几何区分任务（如心理旋转）和复杂多面体分类中表现不佳。 Conclusion: GIQ为未来几何感知表示学习提供了结构化平台，旨在解决几何智能中的关键缺陷。 Abstract: Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ , a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images of 224 diverse polyhedra - including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes - covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

[9] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Andrew Z. Wang,Songwei Ge,Tero Karras,Ming-Yu Liu,Yogesh Balaji

Main category: cs.CV

TL;DR: 研究探讨了使用现代仅解码器LLM作为文本编码器在文本到图像扩散模型中的效果，发现多层归一化平均嵌入优于传统方法，显著提升复杂提示的对齐能力。

Details

Motivation: 当前文本到图像模型仍使用过时的T5和CLIP作为文本编码器，研究旨在评估现代LLM作为替代的潜力。 Method: 构建标准化训练与评估流程，训练27个模型，分析12种文本编码器，探索嵌入提取方法、LLM变体和模型规模的影响。 Result: 多层归一化平均嵌入显著优于传统最后一层嵌入，多数LLM表现优于基线T5模型，提升了视觉语言推理能力。 Conclusion: 现代LLM作为文本编码器在文本到图像生成中表现更优，尤其是采用多层嵌入平均的方法。 Abstract: Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

[10] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation

Ioannis Iakovidis,Zahra Kalantari,Amir Hossein Payberah,Fernando Jaramillo,Francisco Pena Escobar

Main category: cs.CV

TL;DR: 论文提出了一种结合深度聚类和负采样的自监督方法，用于雷达卫星图像的水陆分割，无需人工标注，并通过集成模型提升性能。

Details

Motivation: 解决湿地监测中人工标注数据成本高、耗时长的问题。 Method: 采用深度聚类和负采样的自监督训练方法，并实现集成模型以减少方差。 Result: 自监督集成模型在测试集上的IoU指标比全监督单模型提高了0.02。 Conclusion: 自监督方法在湿地监测中具有潜力，能减少对标注数据的依赖。 Abstract: In recent years the wide availability of high-resolution radar satellite images along with the advancement of computer vision models have enabled the remote monitoring of the surface area of wetlands. However, these models require large amounts of manually annotated satellite images, which are slow and expensive to produce. To overcome this problem, self-supervised training methods have been deployed to train models without using annotated data. In this paper we use a combination of deep clustering and negative sampling to train a model to segment radar satellite images into areas that separate water from land without the use of any manual annotations. Furthermore, we implement an ensemble version of the model to reduce variance and improve performance. Compared to a single fully-supervised model using the same architecture, our ensemble of self-supervised models achieves a 0.02 improvement in the Intersection Over Union metric over our test dataset.

[11] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Octave Mariotti,Zhipeng Du,Yash Bhalgat,Oisin Mac Aodha,Hakan Bilen

Main category: cs.CV

TL;DR: 论文提出了一种通过单目深度估计将2D关键点提升到3D空间的方法，以学习密集语义对应，解决了现有监督方法在稀疏标注关键点上的泛化问题。

Details

Motivation: 现有监督语义对应方法局限于稀疏标注关键点，泛化能力不足，无法有效学习密集对应。 Method: 提出了一种新方法，通过单目深度估计将2D关键点映射到3D空间，构建连续规范流形，无需显式3D监督或相机标注。 Result: 实验表明，该方法在未见关键点上显著优于监督基线，且无监督基线在跨数据集泛化中表现更优。 Conclusion: 该方法通过学习密集对应，显著提升了语义对应的泛化能力，同时展示了无监督方法的潜力。 Abstract: Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

[12] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Vishaal Udandarao,Mehdi Cherti,Shyamgopal Karthik,Jenia Jitsev,Samuel Albanie,Matthias Bethge

Main category: cs.CV

TL;DR: 论文分析了17个用于评估视觉语言模型（VLMs）组合理解能力的基准测试，揭示了其设计中的偏见，并提出改进建议。

Details

Motivation: 研究动机是评估现有基准测试是否能有效衡量VLMs的组合理解能力，并发现其设计中的潜在问题。 Method: 方法包括分析基准测试的设计选择（如数据来源和负样本构建），并比较简单启发式方法与CLIP模型的性能。 Result: 结果显示基准测试存在分布不对称性，导致简单启发式方法表现与CLIP模型相当，未能有效衡量组合理解能力。 Conclusion: 结论是提出改进基准测试设计的建议，以减少偏见并增强其鲁棒性。 Abstract: We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.

[13] Highly Compressed Tokenizer Can Generate Without Training

L. Lao Beyer,T. Li,X. Chen,S. Karaman,K. He

Main category: cs.CV

TL;DR: 1D图像标记器通过高度压缩的一维序列表示图像，支持启发式操作实现图像编辑和生成。

Details

Motivation: 探索1D图像标记器在图像编辑和生成中的潜力，利用其高度压缩的表示空间。 Method: 使用基于梯度的测试时优化和即插即用的损失函数（如重建或CLIP相似性）构建图像生成流程。 Result: 展示了在修复和文本引导图像编辑中的能力，无需训练生成模型即可生成多样且真实的样本。 Conclusion: 1D图像标记器通过简单操作和优化方法，为图像编辑和生成提供了高效且灵活的解决方案。 Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

[14] Seeing Voices: Generating A-Roll Video from Audio with Mirage

Aditi Sundararaman,Amogh Adishesha,Andrew Jaegle,Dan Bigioi,Hyoung-Kyu Song,Jon Kyl,Justin Mao,Kevin Lan,Mojtaba Komeili,ShahRukh Athar,Sheila Babayan,Stanislau Beliasau,William Buchwalter

Main category: cs.CV

TL;DR: Mirage是一种音频到视频的基础模型，能够根据音频输入生成逼真、富有表现力的视频，尤其擅长生成人物讲话的视频。

Details

Motivation: 当前视频生成方法要么忽略音频专注于无声图像序列生成，要么局限于特定应用领域（如重新配音）。Mirage旨在填补这一空白，实现音频与视频的和谐集成。 Method: Mirage采用基于自注意力的统一训练方法，支持从零开始训练或基于现有权重训练，无需依赖特定音频架构或损失组件。 Result: Mirage生成的视频在主观质量上优于其他方法，尤其在人物讲话视频中表现突出。 Conclusion: Mirage为音频到视频生成提供了一种通用且高质量的方法，尤其在人物讲话视频生成方面具有显著优势。 Abstract: From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).

[15] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran,Fanghui Xue,Shuai Zhang,Jiancheng Lyu,Yunling Zheng,Yingyong Qi,Jack Xin

Main category: cs.CV

TL;DR: 论文提出了一种名为SEMA的新型注意力机制，解决了传统注意力在计算复杂度和聚焦能力上的问题，并在图像分类任务中表现出色。

Details

Motivation: 传统注意力机制在计算复杂度（二次复杂度）和线性注意力变体的聚焦能力上存在问题，限制了其在计算机视觉任务中的应用。 Method: 提出了广义注意力的数学定义，并基于此设计了SEMA，利用令牌定位避免注意力分散，同时通过算术平均捕捉全局注意力。 Result: 在Imagenet-1k上的实验表明，SEMA在图像分类任务中表现优于线性注意力和近期视觉Mamba模型，尤其是在更大规模的图像上。 Conclusion: SEMA是一种可扩展且高效的注意力机制，为计算机视觉任务提供了新的解决方案。 Abstract: Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

[16] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal

Kangning Yang,Ling Ouyang,Huiming Sun,Jie Cai,Lan Fu,Jiaming Ding,Chiu Man Ho,Zibo Meng

Main category: cs.CV

TL;DR: 本文提出了一种新的反射数据集收集方法，并发布了一个高质量、多样化的OpenRR-1k数据集，用于提升反射去除技术的鲁棒性。

Details

Motivation: 现有反射去除技术因缺乏高质量野外数据集而受限，需要一种便捷、低成本且可扩展的数据收集方法。 Method: 提出了一种新颖的数据集收集范式，确保数据高质量、对齐且多样化，并基于此收集了OpenRR-1k数据集。 Result: OpenRR-1k数据集包含1,000对高质量图像，实验证明其在提升反射去除技术鲁棒性方面有效。 Conclusion: OpenRR-1k数据集为反射去除技术提供了实用且高质量的数据支持，未来可进一步扩展应用。 Abstract: Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs are of high quality, perfectly aligned, and represent natural and diverse scenarios. Following this paradigm, we collect a Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which contains 1,000 high-quality transmission-reflection image pairs collected in the wild. Through the analysis of several reflection removal methods and benchmark evaluation experiments on our dataset, we demonstrate its effectiveness in improving robustness in challenging real-world environments. Our dataset is available at https://github.com/caijie0620/OpenRR-1k.

[17] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: STNet是一种新型网络架构，通过空间-光谱Transformer模块的创新设计，有效解决了高光谱图像分类中的过拟合和泛化能力问题。

Details

Motivation: 高光谱图像分类面临高维数据、地物稀疏分布和光谱冗余等挑战，导致过拟合和泛化能力受限。 Method: STNet采用空间-光谱Transformer模块，通过解耦空间和光谱注意力及两种门控机制，实现高效特征提取与融合。 Result: 在IN、UP和KSC数据集上，STNet表现优于主流方法。 Conclusion: STNet在不增加网络深度或宽度的情况下，提升了模型表示能力，减少了小样本和高噪声场景的过拟合风险。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more effectively extract and fuse spatial context with fine spectral information in hyperspectral image (HSI) classification, this paper proposes a novel network architecture called STNet. The core advantage of STNet stems from the dual innovative design of its Spatial-Spectral Transformer module: first, the fundamental explicit decoupling of spatial and spectral attention ensures targeted capture of key information in HSI; second, two functionally distinct gating mechanisms perform intelligent regulation at both the fusion level of attention flows (adaptive attention fusion gating) and the internal level of feature transformation (GFFN). This characteristic demonstrates superior feature extraction and fusion capabilities compared to traditional convolutional neural networks, while reducing overfitting risks in small-sample and high-noise scenarios. STNet enhances model representation capability without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.

[18] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera

Yuto Kase,Kai Ishibe,Ryoma Yasuda,Yudai Washida,Sakiko Hashimoto

Main category: cs.CV

TL;DR: 提出了一种使用事件相机实时定位网球拍击球点的方法，解决了高速相机内存消耗大和手动数字化耗时的问题。

Details

Motivation: 在网球等球拍运动中，击球点定位对分析球员和装备特性至关重要，但现有方法存在内存消耗大和人工误差问题。 Method: 通过事件相机高效捕捉亮度变化，结合传统计算机视觉技术和原创的事件处理方法（PATS），分三步识别击球点。 Result: 实验结果表明，该方法在测量网球运动员表现时误差在允许范围内，且计算时间适合实时应用。 Conclusion: 该方法为实时监测球员表现提供了高效、低内存消耗的解决方案。 Abstract: In racket sports, such as tennis, locating the ball's position at impact is important in clarifying player and equipment characteristics, thereby aiding in personalized equipment design. High-speed cameras are used to measure the impact location; however, their excessive memory consumption limits prolonged scene capture, and manual digitization for position detection is time-consuming and prone to human error. These limitations make it difficult to effectively capture the entire playing scene, hindering the ability to analyze the player's performance. We propose a method for locating the tennis ball impact on the racket in real time using an event camera. Event cameras efficiently measure brightness changes (called `events') with microsecond accuracy under high-speed motion while using lower memory consumption. These cameras enable users to continuously monitor their performance over extended periods. Our method consists of three identification steps: time range of swing, timing at impact, and contours of ball and racket. Conventional computer vision techniques are utilized along with an original event-based processing to detect the timing at impact (PATS: the amount of polarity asymmetry in time symmetry). The results of the experiments were within the permissible range for measuring tennis players' performance. Moreover, the computation time was sufficiently short for real-time applications.

[19] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Huixuan Zhang,Junzhe Zhang,Xiaojun Wan

Main category: cs.CV

TL;DR: 提出了一种名为Step AG的通用自适应引导策略，通过限制分类器自由引导在前几个去噪步骤中，显著提升了生成效率，同时保持图像质量和文本对齐。

Details

Motivation: 分类器自由引导方法在文本到视觉生成扩散模型中效率低下，需要两倍于无条件生成的步骤，成本较高。现有自适应引导方法缺乏分析和实证支持，无法通用。 Method: 提出Step AG策略，限制分类器自由引导在前几个去噪步骤中，减少计算成本。 Result: 实验表明，Step AG在图像质量和文本对齐方面表现良好，平均加速20%至30%，且适用于不同模型和设置。 Conclusion: Step AG是一种简单、通用的自适应引导策略，显著提升了生成效率，适用于多种扩散模型。 Abstract: With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.

[20] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Shivang Chopra,Lingchao Mao,Gabriela Sanchez-Rodriguez,Andrew J Feola,Jing Li,Zsolt Kira

Main category: cs.CV

TL;DR: MedMoE提出了一种动态适应不同医学成像模态的视觉-语言处理框架，通过Mixture-of-Experts模块和多尺度特征提取，提升了模态对齐和检索性能。

Details

Motivation: 现有医学视觉-语言框架采用统一的局部特征提取策略，忽略了不同模态的特异性需求，导致诊断信息提取不足。 Method: MedMoE基于Swin Transformer提取多尺度图像特征，通过条件MoE模块动态路由到模态特定的专家分支，实现空间自适应注意力。 Result: 实验表明，MedMoE在多种医学基准测试中提升了模态对齐和检索性能。 Conclusion: MedMoE证明了模态专用视觉表征在临床视觉-语言系统中的重要性。 Abstract: Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework produces localized visual representations aligned with textual descriptions, without requiring modality-specific supervision at inference. Empirical results on diverse medical benchmarks demonstrate that MedMoE improves alignment and retrieval performance across imaging modalities, underscoring the value of modality-specialized visual representations in clinical vision-language systems.

[21] Image Demoiréing Using Dual Camera Fusion on Mobile Phones

Yanting Mei,Zhilu Zhang,Xiaohe Wu,Wangmeng Zuo

Main category: cs.CV

TL;DR: 提出了一种利用双摄像头融合（DCID）去除图像摩尔纹的方法，通过超广角图像辅助广角图像去摩尔纹，效果优于现有方法。

Details

Motivation: 现代智能手机通常配备双镜头，且超广角图像在广角图像出现摩尔纹时能提供正常颜色和纹理。 Method: 提出轻量级超广角图像编码器，集成到现有去摩尔纹网络中，并采用快速两阶段图像对齐方式。 Result: 在包含约9000个样本的真实数据集上实验，效果优于现有方法。 Conclusion: DCID方法通过双摄像头融合有效去除摩尔纹，代码和数据集已开源。 Abstract: When shooting electronic screens, moir\'e patterns usually appear in captured images, which seriously affects the image quality. Existing image demoir\'eing methods face great challenges in removing large and heavy moir\'e. To address the issue, we propose to utilize Dual Camera fusion for Image Demoir\'eing (DCID), \ie, using the ultra-wide-angle (UW) image to assist the moir\'e removal of wide-angle (W) image. This is inspired by two motivations: (1) the two lenses are commonly equipped with modern smartphones, (2) the UW image generally can provide normal colors and textures when moir\'e exists in the W image mainly due to their different focal lengths. In particular, we propose an efficient DCID method, where a lightweight UW image encoder is integrated into an existing demoir\'eing network and a fast two-stage image alignment manner is present. Moreover, we construct a large-scale real-world dataset with diverse mobile phones and monitors, containing about 9,000 samples. Experiments on the dataset show our method performs better than state-of-the-art methods. Code and dataset are available at https://github.com/Mrduckk/DCID.

[22] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Woohyeon Park,Woojin Kim,Jaeik Kim,Jaeyoung Do

Main category: cs.CV

TL;DR: SECOND是一种通过选择性和对比性解码解决视觉语言模型中物体幻觉问题的新方法，利用多尺度视觉信息提升性能。

Details

Motivation: 现有视觉语言模型因物体幻觉问题导致性能受限，需更精确的视觉理解方法。 Method: 提出SECOND方法，通过选择性和对比性解码逐步整合多尺度视觉信息，减少幻觉。 Result: SECOND显著减少幻觉并在多个基准测试中表现优异。 Conclusion: 多尺度视觉信息的应用潜力巨大，SECOND方法优于现有技术。 Abstract: Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

[23] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation

Taiqin Chen,Zikun Zhou,Zheng Fang,Wenzhen Zou,Kanjun Liu,Ke Chen,Yongbing Zhang,Yaowei Wang

Main category: cs.CV

TL;DR: 提出了一种基于稀疏信号恢复和物理传播模型的无线电地图估计方法RadioDUN，通过动态重加权模块和阴影损失提升性能。

Details

Motivation: 现有深度学习方法难以结合无线电地图的物理特性，稀疏样本下构建密集无线电地图具有挑战性。 Method: 将无线电地图估计建模为稀疏信号恢复问题，结合物理传播模型分解优化子问题，提出RadioDUN网络和动态重加权模块（DRM），并引入阴影损失。 Result: 实验表明，RadioDUN优于现有方法。 Conclusion: RadioDUN通过结合物理特性和自适应学习，有效提升了无线电地图估计的性能。 Abstract: The radio map represents the spatial distribution of spectrum resources within a region, supporting efficient resource allocation and interference mitigation. However, it is difficult to construct a dense radio map as a limited number of samples can be measured in practical scenarios. While existing works have used deep learning to estimate dense radio maps from sparse samples, they are hard to integrate with the physical characteristics of the radio map. To address this challenge, we cast radio map estimation as the sparse signal recovery problem. A physical propagation model is further incorporated to decompose the problem into multiple factor optimization sub-problems, thereby reducing recovery complexity. Inspired by the existing compressive sensing methods, we propose the Radio Deep Unfolding Network (RadioDUN) to unfold the optimization process, achieving adaptive parameter adjusting and prior fitting in a learnable manner. To account for the radio propagation characteristics, we develop a dynamic reweighting module (DRM) to adaptively model the importance of each factor for the radio map. Inspired by the shadowing factor in the physical propagation model, we integrate obstacle-related factors to express the obstacle-induced signal stochastic decay. The shadowing loss is further designed to constrain the factor prediction and act as a supplementary supervised objective, which enhances the performance of RadioDUN. Extensive experiments have been conducted to demonstrate that the proposed method outperforms the state-of-the-art methods. Our code will be made publicly available upon publication.

[24] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Mingjie Xu,Andrew Estornell,Hongzheng Yang,Yuzhi Zhao,Zhaowei Zhu,Qi Xuan,Jiaheng Wei

Main category: cs.CV

TL;DR: SCALE提出了一种新的数据选择流程，通过跨模态评估提升视觉语言模型的数据质量，解决了图像与文本对齐噪声和文本模糊的问题。

Details

Motivation: 当前视觉语言模型的效果依赖于高质量数据集，但存在图像与文本对齐噪声和文本模糊的问题，影响了模型的准确性和鲁棒性。 Method: SCALE通过跨模态评估框架，为数据条目分配任务、生成任务相关描述，并评估对齐性、清晰度、任务稀有性、文本连贯性和图像清晰度。 Result: 研究发现，当前单模态质量评估方法低估了对特定任务重要的样本，而适当生成的图像描述能有效将多模态任务转化为统一文本模态。 Conclusion: SCALE为视觉语言模型的数据选择提供了高效解决方案，提升了模型的性能和鲁棒性。 Abstract: The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more comprehensive visual language datasets. However, the effectiveness of VLMs is highly dependent on large-scale, high-quality datasets that ensure precise recognition and accurate reasoning. Two key challenges hinder progress: (1) noisy alignments between images and the corresponding text, which leads to misinterpretation, and (2) ambiguous or misleading text, which obscures visual content. To address these challenges, we propose SCALE (Single modality data quality and Cross modality Alignment Evaluation), a novel quality-driven data selection pipeline for VLM instruction tuning datasets. Specifically, SCALE integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task, generates general and task-specific captions (covering scenes, objects, style, etc.), and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry based on the generated captions. We reveal that: (1) current unimodal quality assessment methods evaluate one modality while overlooking the rest, which can underestimate samples essential for specific tasks and discard the lower-quality instances that help build model robustness; and (2) appropriately generated image captions provide an efficient way to transfer the image-text multimodal task into a unified text modality.

[25] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

June Suk Choi,Kyungmin Lee,Sihyun Yu,Yisol Choi,Jinwoo Shin,Kimin Lee

Main category: cs.CV

TL;DR: 论文提出自适应低通引导（ALG）方法，解决图像到视频（I2V）生成中因高频细节过早暴露导致的静态视频问题，显著提升动态性。

Details

Motivation: 现有I2V方法在微调文本到视频（T2V）模型时，常因高频细节过早暴露导致生成视频动态性不足，表现为静态化。 Method: 提出ALG方法，在去噪早期阶段自适应地对输入图像进行低通滤波，减少高频细节对采样过程的干扰。 Result: ALG显著提升视频动态性（VBench-I2V测试中动态度平均提升36%），同时保持图像质量和文本对齐。 Conclusion: ALG是一种简单有效的改进方法，解决了I2V生成中的动态性不足问题，且无需牺牲其他性能。 Abstract: Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.

[26] MARMOT: Masked Autoencoder for Modeling Transient Imaging

Siyuan Shen,Ziheng Wang,Xingyue Peng,Suan Xia,Ruiqian Li,Shiying Li,Jingyi Yu

Main category: cs.CV

TL;DR: 提出了一种基于掩码自编码器（MARMOT）的自监督预训练模型，用于处理非视距（NLOS）瞬态成像任务，通过大规模数据集预训练并展示了高效性能。

Details

Motivation: 现有方法在NLOS场景中主要优化体积密度或表面重建，缺乏从数据集中学习先验知识的能力。MARMOT旨在通过自监督预训练填补这一空白。 Method: MARMOT采用基于Transformer的编码器-解码器结构，通过扫描模式掩码（SPM）从部分掩码的瞬态数据中学习特征，并预测完整测量结果。预训练使用了包含500K 3D模型的合成数据集TransVerse。 Result: 实验表明，MARMOT在定量和定性评估中均优于现有方法，展示了其高效性。 Conclusion: MARMOT为NLOS瞬态成像提供了一种高效的预训练模型，支持直接特征迁移或解码器微调，适用于下游任务。 Abstract: Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects are measured beyond the sensor's direct line of sight. Using NLOS transients, the majority of previous works optimize volume density or surfaces to reconstruct the hidden objects and do not transfer priors learned from datasets. In this work, we present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets. Using a Transformer-based encoder-decoder, MARMOT learns features from partially masked transients via a scanning pattern mask (SPM), where the unmasked subset is functionally equivalent to arbitrary sampling, and predicts full measurements. Pretrained on TransVerse-a synthesized transient dataset of 500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature transfer or decoder finetuning. Comprehensive experiments are carried out in comparisons with state-of-the-art methods. Quantitative and qualitative results demonstrate the efficiency of our MARMOT.

[27] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Qilin Yin,Wei Lu,Xiangyang Luo,Xiaochun Cao

Main category: cs.CV

TL;DR: 论文提出了一种通用的上下文感知对比学习框架（UniCaCLF），用于解决多媒体取证领域中局部视频片段篡改检测的挑战。

Details

Motivation: 现有研究主要将深度伪造检测视为分类任务，忽略了视频中部分片段被篡改的情况，而局部篡改检测更符合实际应用需求。 Method: 采用监督对比学习，通过异常检测识别篡改片段，并引入上下文感知感知层和自适应上下文更新器，构建上下文感知对比目标。 Result: 在五个公开数据集上的实验表明，UniCaCLF显著优于现有算法。 Conclusion: UniCaCLF通过上下文感知对比学习，有效提升了局部篡改片段的检测性能。 Abstract: Most research efforts in the multimedia forensics domain have focused on detecting forgery audio-visual content and reached sound achievements. However, these works only consider deepfake detection as a classification task and ignore the case where partial segments of the video are tampered with. Temporal forgery localization (TFL) of small fake audio-visual clips embedded in real videos is still challenging and more in line with realistic application scenarios. To resolve this issue, we propose a universal context-aware contrastive learning framework (UniCaCLF) for TFL. Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection, allowing for the precise localization of temporal forged segments. To this end, we propose a novel context-aware perception layer that utilizes a heterogeneous activation operation and an adaptive context updater to construct a context-aware contrastive objective, which enhances the discriminability of forged instant features by contrasting them with genuine instant features in terms of their distances to the global context. An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants in a supervised sample-by-sample manner, suppressing the cross-sample influence to improve temporal forgery localization performance. Extensive experimental results over five public datasets demonstrate that our proposed UniCaCLF significantly outperforms the state-of-the-art competing algorithms.

Zhiyi Zhu,Xiaoyu Wu,Zihao Liu,Linlin Yang

Main category: cs.CV

TL;DR: MLVTG提出了一种新的视频时间定位框架，通过MambaAligner和LLMRefiner模块解决了现有Transformer方法中的冗余注意力和多模态对齐问题，实现了更精确的定位。

Details

Motivation: 现有基于Transformer的视频时间定位方法存在冗余注意力和多模态对齐不足的问题，影响了定位的准确性。 Method: MLVTG结合了MambaAligner（使用Vision Mamba块建模时间依赖）和LLMRefiner（利用预训练LLM的语义先验增强对齐），实现了时间和语义的双重对齐。 Result: 在QVHighlights、Charades-STA和TVSum等数据集上，MLVTG表现优于现有基线方法，达到了最先进的性能。 Conclusion: MLVTG通过创新的双重对齐策略，显著提升了视频时间定位的精度，为多模态视频理解提供了新思路。 Abstract: Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

[29] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

Zhongtao Tian,Wenhao Huang,Zhidong Chen,Xiao Wei Sun

Main category: cs.CV

TL;DR: 提出了一种结合多尺度特征学习和语义场景理解的框架，用于动态环境中的视觉定位，性能优于现有方法。

Details

Motivation: 动态环境中光照变化、恶劣天气和移动物体会干扰外观线索，现有绝对姿态回归方法难以保持一致性。 Method: 采用分层Transformer和跨尺度注意力融合几何细节与上下文线索，并通过语义监督学习视图不变特征。 Result: 在TartanAir数据集上，该方法在动态物体、光照变化和遮挡等挑战性场景中表现优于现有姿态回归方法。 Conclusion: 多尺度处理与语义指导的结合为动态环境中的鲁棒视觉定位提供了有效策略。 Abstract: Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature learning with semantic scene understanding. Our approach employs a hierarchical Transformer with cross-scale attention to fuse geometric details and contextual cues, preserving spatial precision while adapting to environmental changes. We improve the performance of this architecture with semantic supervision via neural scene representation during training, guiding the network to learn view-invariant features that encode persistent structural information while suppressing complex environmental interference. Experiments on TartanAir demonstrate that our approach outperforms existing pose regression methods in challenging scenarios with dynamic objects, illumination changes, and occlusions. Our findings show that integrating multi-scale processing with semantic guidance offers a promising strategy for robust visual localization in real-world dynamic environments.

[30] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s

Xijun Wang,Xin Li,Bingchen Li,Zhibo Chen

Main category: cs.CV

TL;DR: LiftVSR是一种高效的视频超分辨率框架，通过结合动态时间注意力和注意力内存缓存，显著降低了计算成本，同时保持了长期一致性。

Details

Motivation: 现有方法在时间一致性和计算成本方面存在不足，尤其是处理长视频时。 Method: 提出LiftVSR框架，结合动态时间注意力（DTA）和注意力内存缓存（AMC），并引入非对称采样策略。 Result: 在多个VSR基准测试中表现出色，计算成本显著降低。 Conclusion: LiftVSR在效率和性能上取得了平衡，为视频超分辨率提供了实用解决方案。 Abstract: Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$\alpha$, achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.

Qi Yan,Brian Zhang,Yutong Zhang,Daniel Yang,Joshua White,Di Chen,Jiachao Liu,Langechuan Liu,Binnan Zhuang,Shaoshuai Shi,Renjie Liao

Main category: cs.CV

TL;DR: TrajFlow是一种基于流匹配的运动预测框架，通过单次推理预测多模态轨迹，显著降低计算开销，并在Waymo数据集上表现优异。

Details

Motivation: 自动驾驶需要高效准确的多模态运动预测以确保安全，现有生成方法计算开销大且效率低。 Method: 提出TrajFlow框架，采用流匹配技术单次预测多轨迹，引入排名损失改进不确定性估计，并使用自条件训练技术提升泛化能力。 Result: 在Waymo数据集上实现最优性能，验证了其高效性和准确性。 Conclusion: TrajFlow为安全关键型自动驾驶应用提供了高效、可扩展的解决方案。 Abstract: Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.

[32] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs

Bowei Tian,Xuntao Lyu,Meng Liu,Hongyi Wang,Ang Li

Main category: cs.CV

TL;DR: 论文提出Input-Space Linearity Hypothesis（ISLH），认为概念对齐的方向起源于输入空间并随网络深度选择性放大，并引入Spectral Principal Path（SPP）框架，展示其在多模态模型中的鲁棒性。

Details

Motivation: 提升AI透明性和控制性，从神经元转向结构化语义方向，验证Linear Representation Hypothesis（LRH）的扩展。 Method: 提出ISLH假说和SPP框架，分析深度网络中线性表示的逐步提炼过程，并在Vision-Language Models（VLMs）中验证多模态鲁棒性。 Result: SPP框架成功展示了深度网络中线性表示的形成过程，并在VLMs中验证了其多模态鲁棒性。 Conclusion: 研究为深度网络表示形成提供了结构化理论，有助于提升AI的鲁棒性、公平性和透明性。 Abstract: High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned directions originate in the input space and are selectively amplified with increasing depth. We then introduce the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along a small set of dominant spectral directions. Building on this framework, we further demonstrate the multimodal robustness of these representations in Vision-Language Models (VLMs). By bridging theoretical insights with empirical validation, this work advances a structured theory of representation formation in deep networks, paving the way for improving AI robustness, fairness, and transparency.

[33] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci

Main category: cs.CV

TL;DR: SceneNet和KnowledgeNet是用于HD-EPIC VQA Challenge 2025的方法，分别利用场景图和外部常识知识提升视觉问答性能，组合后准确率达44.21%。

Details

Motivation: 解决复杂自我中心视觉问答任务中细粒度对象交互和高级语义推理的需求。 Method: SceneNet通过多模态大语言模型生成场景图捕捉对象交互和时空关系；KnowledgeNet引入ConceptNet的常识知识增强语义推理。 Result: 在HD-EPIC基准的七类任务中表现优异，组合方法准确率达44.21%。 Conclusion: SceneNet和KnowledgeNet的组合有效提升了复杂视觉问答任务的性能。 Abstract: This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

[34] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement

Xinyue Niu,Akira Furui

Main category: cs.CV

TL;DR: 提出了一种通过特征解耦消除校准需求的方法，用于跨受试者肌电图（EMG）模式识别。

Details

Motivation: 跨受试者EMG模式识别因个体差异面临挑战，传统方法依赖特定校准数据，不适用于大规模部署。 Method: 采用端到端双分支对抗神经网络，将EMG特征解耦为模式特定和受试者特定成分。 Result: 模型在未见用户数据上表现优异，优于基线方法。 Conclusion: 该方法为无需校准的跨受试者EMG识别提供了新思路，并展示了其在生物识别等领域的潜力。 Abstract: Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an approach to eliminate calibration requirements through feature disentanglement, enabling effective cross-subject generalization. We propose an end-to-end dual-branch adversarial neural network that simultaneously performs pattern recognition and individual identification by disentangling EMG features into pattern-specific and subject-specific components. The pattern-specific components facilitate robust pattern recognition for new users without model calibration, while the subject-specific components enable downstream applications such as task-invariant biometric identification. Experimental results demonstrate that the proposed model achieves robust performance on data from unseen users, outperforming various baseline methods in cross-subject scenarios. Overall, this study offers a new perspective for cross-subject EMG pattern recognition without model calibration and highlights the proposed model's potential for broader applications, such as task-independent biometric systems.

[35] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection

Duc Thanh Pham,Hong Dang Nguyen,Nhat Minh Nguyen Quoc,Linh Ngo Van,Sang Dinh Viet,Duc Anh Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种名为Hier-DETR的新框架，用于增量目标检测（IOD），通过利用神经崩溃和类标签的层次关系，实现了高效且具有竞争力的性能。

Details

Motivation: 现实世界中新物体不断出现，需要检测模型持续学习而不受灾难性遗忘的影响，但现有IOD模型因性能有限和推理时间长而不够实用。 Method: 提出Hier-DETR框架，结合神经崩溃（Neural Collapse）处理不平衡数据集和类标签的层次关系。 Result: Hier-DETR在效率和性能上均表现出色。 Conclusion: Hier-DETR为解决IOD问题提供了一种高效且性能优越的解决方案。 Abstract: Recently, object detection models have witnessed notable performance improvements, particularly with transformer-based models. However, new objects frequently appear in the real world, requiring detection models to continually learn without suffering from catastrophic forgetting. Although Incremental Object Detection (IOD) has emerged to address this challenge, these existing models are still not practical due to their limited performance and prolonged inference time. In this paper, we introduce a novel framework for IOD, called Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both efficiency and competitive performance by leveraging Neural Collapse for imbalance dataset and Hierarchical relation of classes' labels.

Yibo Cui,Liang Xie,Yu Zhao,Jiawei Sun,Erwei Yin

Main category: cs.CV

TL;DR: FCA-NIG是一个生成框架，用于自动构建具有细粒度跨模态注释的导航指令，解决了现有数据集中子指令和实体级对齐不足的问题。生成的FCA-R2R数据集显著提升了多种VLN智能体的性能。

Details

Motivation: 现有VLN数据集缺乏细粒度的跨模态对齐注释（子指令和实体级），影响了导航决策的准确性。 Method: FCA-NIG通过分割轨迹、检测地标、生成指令和选择实体，自动构建带有子指令-轨迹和实体-地标对齐的完整指令-轨迹对。 Result: 生成的FCA-R2R数据集显著提升了多种VLN智能体的性能，增强了状态感知和导航准确性。 Conclusion: FCA-NIG无需人工标注即可生成高质量、可扩展的训练数据，推动了复杂导航任务中细粒度跨模态学习的发展。 Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents' state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.

[37] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Chengchao Shen,Hourun Zhu,Gongfan Fang,Jianxin Wang,Xinchao Wang

Main category: cs.CV

TL;DR: 论文提出了一种多样性引导的MLP缩减方法（DGMR），通过Gram-Schmidt权重剪枝策略显著减少大型视觉Transformer的参数，同时保持性能。

Details

Motivation: 大型Transformer模型参数过多导致计算和内存成本高昂，研究发现MLP模块占用了大部分参数，因此需要一种高效压缩方法。 Method: 采用Gram-Schmidt权重剪枝策略消除MLP隐藏层的冗余神经元，同时保留权重多样性以在蒸馏过程中更好地恢复性能。 Result: 在多个大型视觉Transformer上，DGMR实现了超过57%的参数和FLOPs减少，且性能损失极小；在EVA-CLIP-E（4.4B）上实现了71.5%的减少且无性能下降。 Conclusion: DGMR是一种高效且近无损的模型压缩方法，显著降低了大型视觉Transformer的计算和内存需求。 Abstract: Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.

[38] Transformers Meet Hyperspectral Imaging: A Comprehensive Study of Models, Challenges and Open Problems

Guyang Zhang,Waleed Abdulla

Main category: cs.CV

TL;DR: 本文是第一篇专注于基于Transformer的高光谱图像分类的端到端综述，分析了300多篇论文，总结了Transformer在HSI中的应用及其挑战。

Details

Motivation: Transformer在长距离依赖学习中表现出色，但在高光谱成像（HSI）中的应用尚未成熟，因此需要系统梳理其应用现状和未来方向。 Method: 研究分类了HSI处理流程的每个阶段（如预处理、特征提取、自注意力变体等），并与HSI的独特特性对比设计选择。 Result: 总结了当前进展与挑战（如标注数据稀缺、计算开销大等），并提出了未来研究方向（如轻量模型、可解释性等）。 Conclusion: 旨在帮助研究者选择或扩展适合下一代HSI应用的Transformer组件。 Abstract: Transformers have become the architecture of choice for learning long-range dependencies, yet their adoption in hyperspectral imaging (HSI) is still emerging. We reviewed more than 300 papers published up to 2025 and present the first end-to-end survey dedicated to Transformer-based HSI classification. The study categorizes every stage of a typical pipeline-pre-processing, patch or pixel tokenization, positional encoding, spatial-spectral feature extraction, multi-head self-attention variants, skip connections, and loss design-and contrasts alternative design choices with the unique spatial-spectral properties of HSI. We map the field's progress against persistent obstacles: scarce labeled data, extreme spectral dimensionality, computational overhead, and limited model explainability. Finally, we outline a research agenda prioritizing valuable public data sets, lightweight on-edge models, illumination and sensor shifts robustness, and intrinsically interpretable attention mechanisms. Our goal is to guide researchers in selecting, combining, or extending Transformer components that are truly fit for purpose for next-generation HSI applications.

[39] Towards Class-wise Fair Adversarial Training via Anti-Bias Soft Label Distillation

Shiji Zhao,Chi Chen,Ranjie Duan,Xizhe Wang,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为ABSLD的方法，通过调整教师模型软标签的平滑度来提升对抗鲁棒公平性，解决了AT和ARD中存在的类别间鲁棒性不均衡问题。

Details

Motivation: 对抗训练（AT）和对抗鲁棒性蒸馏（ARD）存在鲁棒公平性问题，即模型对某些类别的对抗鲁棒性较强，而对其他类别较弱。本文旨在探索这一问题的原因并提出解决方案。 Method: 提出Anti-Bias Soft Label Distillation（ABSLD），通过为不同类别分配不同温度参数，调整教师模型软标签的平滑度，从而减少学生模型在不同类别间的误差风险差距。 Result: 实验表明，ABSLD在鲁棒性和公平性的综合性能上优于现有方法。 Conclusion: ABSLD是一种高效且适应性强的标签方法，可与其他样本方法结合使用，显著提升了对抗鲁棒公平性。 Abstract: Adversarial Training (AT) is widely recognized as an effective approach to enhance the adversarial robustness of Deep Neural Networks. As a variant of AT, Adversarial Robustness Distillation (ARD) has shown outstanding performance in enhancing the robustness of small models. However, both AT and ARD face robust fairness issue: these models tend to display strong adversarial robustness against some classes (easy classes) while demonstrating weak adversarial robustness against others (hard classes). This paper explores the underlying factors of this problem and points out the smoothness degree of soft labels for different classes significantly impacts the robust fairness from both empirical observation and theoretical analysis. Based on the above exploration, we propose Anti-Bias Soft Label Distillation (ABSLD) within the Knowledge Distillation framework to enhance the adversarial robust fairness. Specifically, ABSLD adaptively reduces the student's error risk gap between different classes, which is accomplished by adjusting the class-wise smoothness degree of teacher's soft labels during the training process, and the adjustment is managed by assigning varying temperatures to different classes. Additionally, as a label-based approach, ABSLD is highly adaptable and can be integrated with the sample-based methods. Extensive experiments demonstrate ABSLD outperforms state-of-the-art methods on the comprehensive performance of robustness and fairness.

[40] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Robert-Jan Bruintjes,Attila Lengyel,Osman Semih Kayhan,Davide Zambrano,Nergis Tömen,Hadi Jamali-Rad,Jan van Gemert

Main category: cs.CV

TL;DR: 论文探讨了在数据不足的情况下，哪些深度学习方法能有效训练模型，通过组织数据受限的挑战赛，激发新方法以提高数据效率。

Details

Motivation: 解决数据不足时深度学习模型性能下降的问题，推动结合先验知识的新方法发展。 Method: 组织四届数据受限的挑战赛，限制参与者使用少量样本从头训练模型，禁止迁移学习。 Result: 成功参赛作品结合了Transformer与CNN的大型模型集成、强数据增强及先验知识方法。 Conclusion: 通过挑战赛验证了结合先验知识的方法在数据不足场景下的有效性。 Abstract: Deep Learning requires large amounts of data to train models that work well. In data-deficient settings, performance can be degraded. We investigate which Deep Learning methods benefit training models in a data-deficient setting, by organizing the "VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning" workshop series, featuring four editions of data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate prior knowledge to improve the data efficiency of deep learning models. Successful challenge entries make use of large model ensembles that mix Transformers and CNNs, as well as heavy data augmentation. Novel prior knowledge-based methods contribute to success in some entries.

[41] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything

Joost van Dalen,Yuki M. Asano,Marc Russwurm

Main category: cs.CV

TL;DR: SAMSelect是一种算法，用于从多光谱图像中选择最佳三通道可视化组合，以帮助海洋科学家更直观地识别海洋漂浮垃圾。

Details

Motivation: 海洋漂浮垃圾在中等分辨率图像中因成分异质性难以可视化，专家通常依赖经验和启发式方法选择波段组合，缺乏统一标准。 Method: SAMSelect通过Segment Anything Model在小标注数据集上选择分类准确率最高的波段或指数组合，假设最佳分割结果也提供最佳视觉信息。 Result: 在加纳阿克拉和南非德班的Sentinel-2场景中测试，SAMSelect发现新的波段组合（如B8和B2的归一化差异指数），性能优于文献中的指数。 Conclusion: SAMSelect为海洋领域的视觉解译提供了有效工具，开源代码库将进一步支持领域科学家的工作。 Abstract: This work proposes SAMSelect, an algorithm to obtain a salient three-channel visualization for multispectral images. We develop SAMSelect and show its use for marine scientists visually interpreting floating marine debris in Sentinel-2 imagery. These debris are notoriously difficult to visualize due to their compositional heterogeneity in medium-resolution imagery. Out of these difficulties, a visual interpretation of imagery showing marine debris remains a common practice by domain experts, who select bands and spectral indices on a case-by-case basis informed by common practices and heuristics. SAMSelect selects the band or index combination that achieves the best classification accuracy on a small annotated dataset through the Segment Anything Model. Its central assumption is that the three-channel visualization achieves the most accurate segmentation results also provide good visual information for photo-interpretation. We evaluate SAMSelect in three Sentinel-2 scenes containing generic marine debris in Accra, Ghana, and Durban, South Africa, and deployed plastic targets from the Plastic Litter Project. This reveals the potential of new previously unused band combinations (e.g., a normalized difference index of B8, B2), which demonstrate improved performance compared to literature-based indices. We describe the algorithm in this paper and provide an open-source code repository that will be helpful for domain scientists doing visual photo interpretation, especially in the marine field.

[42] A Probability-guided Sampler for Neural Implicit Surface Rendering

Gonçalo Dias Pais,Valter Piedade,Moitreya Chatterjee,Marcus Greiff,Pedro Miraldo

Main category: cs.CV

TL;DR: 本文提出了一种基于隐式表面表示和3D图像投影空间概率密度函数的采样策略，结合新的表面重建损失，提升了NeRF的渲染和重建精度。

Details

Motivation: 现有NeRF方法因可扩展性问题无法训练所有可能的输入数据，导致采样效率低下。本文旨在通过更针对性的采样和新的损失函数提升性能。 Method: 利用隐式表面表示建模3D图像投影空间的概率密度函数，实现针对性采样；提出新的表面重建损失，结合近表面和空空间信息。 Result: 在现有神经隐式表面渲染器中集成新方法后，实现了更准确和详细的3D重建与图像渲染，尤其在感兴趣区域表现更优。 Conclusion: 提出的采样策略和损失函数显著提升了NeRF的渲染和重建效果，为未来研究提供了新方向。 Abstract: Several variants of Neural Radiance Fields (NeRFs) have significantly improved the accuracy of synthesized images and surface reconstruction of 3D scenes/objects. In all of these methods, a key characteristic is that none can train the neural network with every possible input data, specifically, every pixel and potential 3D point along the projection rays due to scalability issues. While vanilla NeRFs uniformly sample both the image pixels and 3D points along the projection rays, some variants focus only on guiding the sampling of the 3D points along the projection rays. In this paper, we leverage the implicit surface representation of the foreground scene and model a probability density function in a 3D image projection space to achieve a more targeted sampling of the rays toward regions of interest, resulting in improved rendering. Additionally, a new surface reconstruction loss is proposed for improved performance. This new loss fully explores the proposed 3D image projection space model and incorporates near-to-surface and empty space components. By integrating our novel sampling strategy and novel loss into current state-of-the-art neural implicit surface renderers, we achieve more accurate and detailed 3D reconstructions and improved image rendering, especially for the regions of interest in any given scene.

[43] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network

Feixiang Du,Shengkun Wu

Main category: cs.CV

TL;DR: 论文提出了一种轻量级的CNN-Mamba混合网络（ECMNet）用于语义分割，通过结合CNN和Mamba的优势，解决了全局上下文建模不足的问题。

Details

Motivation: 尽管CNN和Transformer在语义分割中表现优异，但全局上下文建模仍有不足。Mamba在视觉任务中展现出长距离依赖建模的优势，因此作者希望通过结合CNN和Mamba来弥补各自的弱点。 Method: 设计了增强双注意力块（EDAB）作为轻量级瓶颈，多尺度注意力单元（MSAU）用于特征表示增强，以及Mamba增强的特征融合模块（FFM）用于多级特征融合。 Result: 在两个代表性数据集（Cityscapes和CamVid）上表现出色，分别达到70.6%和73.6%的mIoU，参数为0.87M，计算量为8.27G FLOPs。 Conclusion: ECMNet在准确性和效率之间取得了良好平衡，验证了CNN与Mamba结合的有效性。 Abstract: In the past decade, Convolutional Neural Networks (CNNs) and Transformers have achieved wide applicaiton in semantic segmentation tasks. Although CNNs with Transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long-range dependency. In this paper, we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses. Specifically, We design a Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to improve the representations ability of feature, We devise a Multi-Scale Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion Module (FFM) merges diverse level feature, significantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.

[44] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Dong Chen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok

Main category: cs.CV

TL;DR: RoboSwap提出了一种结合GAN和扩散模型的视频编辑框架，用于在未配对数据中交换机器人手臂，解决了跨平台机器人学习中的数据稀缺问题。

Details

Motivation: 当前视频合成和编辑的生成模型虽有进展，但高质量数据集的稀缺限制了视频条件下机器人学习的跨平台泛化能力。 Method: RoboSwap通过分割机器人手臂并训练未配对的GAN模型进行翻译，再结合扩散模型增强视频的连贯性和运动真实性。 Result: 实验表明，RoboSwap在三个基准测试中优于现有视频和图像编辑模型，表现出更高的结构连贯性和运动一致性。 Conclusion: RoboSwap为机器人学习提供了可靠的跨平台数据生成解决方案。 Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

[45] SurfR: Surface Reconstruction with Multi-scale Attention

Siddhant Ranade,Gonçalo Dias Pais,Ross Tyler Whitaker,Jacinto C. Nascimento,Pedro Miraldo,Srikumar Ramalingam

Main category: cs.CV

TL;DR: 提出了一种基于隐式表示的快速且精确的无组织点云表面重建算法，通过三个关键贡献实现了最佳的速度-精度权衡。

Details

Motivation: 现有学习方法要么需要针对单个对象训练的小模型（细节丰富但泛化能力差），要么是泛化能力强但细节不足且推理速度慢的大模型。 Method: 1. 延迟查询（lazy query）加速重建；2. 使用并行多尺度网格表示提取鲁棒特征；3. 跨尺度注意力机制提升重建效果。 Result: 提出的方法在速度上优于所有基线模型，同时性能仅略低于最先进方法。 Conclusion: 该算法在速度和精度之间取得了最佳平衡，适用于通用3D形状的重建。 Abstract: We propose a fast and accurate surface reconstruction algorithm for unorganized point clouds using an implicit representation. Recent learning methods are either single-object representations with small neural models that allow for high surface details but require per-object training or generalized representations that require larger models and generalize to newer shapes but lack details, and inference is slow. We propose a new implicit representation for general 3D shapes that is faster than all the baselines at their optimum resolution, with only a marginal loss in performance compared to the state-of-the-art. We achieve the best accuracy-speed trade-off using three key contributions. Many implicit methods extract features from the point cloud to classify whether a query point is inside or outside the object. First, to speed up the reconstruction, we show that this feature extraction does not need to use the query point at an early stage (lazy query). Second, we use a parallel multi-scale grid representation to develop robust features for different noise levels and input resolutions. Finally, we show that attention across scales can provide improved reconstruction results.

[46] Orientation Matters: Making 3D Generative Models Orientation-Aligned

Yichong Lu,Yuzhuo Tian,Zijin Jiang,Yikun Zhao,Yuanbo Yang,Hao Ouyang,Haoji Hu,Huimin Yu,Yujun Shen,Yiyi Liao

Main category: cs.CV

TL;DR: 论文提出了一种方向对齐的3D物体生成任务，并构建了Objaverse-OA数据集，通过微调现有模型实现了跨类别的一致性生成。

Details

Motivation: 现有3D生成模型因训练数据不一致导致结果不对齐，限制了其在下游任务中的应用。 Method: 构建Objaverse-OA数据集，基于多视角扩散和3D变分自编码器框架微调模型。 Result: 实验表明该方法优于后处理对齐方法，并展示了零样本方向估计和高效旋转操作等应用。 Conclusion: 方向对齐的3D生成方法显著提升了模型在下游任务中的实用性。 Abstract: Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

Zhiyi Zhu,Xiaoyu Wu,Youwei Lu

Main category: cs.CV

TL;DR: 论文提出了一种新的多模态视频记忆性预测模型TMCCL，通过文本-运动跨模态对比损失增强运动特征表示，并在视频摘要中应用记忆性加权校正（MWCVS）以减少主观性。

Details

Motivation: 现有模型在预测视频记忆性时未能充分利用运动特征，且运动特征表示在微调阶段因缺乏标注数据而受损。 Method: 引入TMCCL模型，利用文本描述相似性构建正负运动样本集，增强运动特征表示；并开发MWCVS方法，将记忆性预测应用于视频摘要。 Result: TMCCL在两个视频记忆性预测数据集上达到最优性能；MWCVS在视频摘要任务中表现出色。 Conclusion: TMCCL和MWCVS展示了视频记忆性预测的潜力，尤其在提升运动特征表示和减少视频摘要主观性方面。 Abstract: Video memorability refers to the ability of videos to be recalled after viewing, playing a crucial role in creating content that remains memorable. Existing models typically focus on extracting multimodal features to predict video memorability scores but often fail to fully utilize motion cues. The representation of motion features is compromised during the fine-tuning phase of the motion feature extractor due to a lack of labeled data. In this paper, we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model designed to enhance the representation of motion features. We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos to establish positive and negative motion sample sets for a given target. This enhancement allows the model to learn similar feature representations for semantically related motion content, resulting in more accurate memorability predictions. Our model achieves state-of-the-art performance on two video memorability prediction datasets. Moreover, the potential applications of video memorability prediction have been underexplored. To address this gap, we present Memorability Weighted Correction for Video Summarization (MWCVS), using video memorability prediction to reduce subjectivity in video summarization labels. Experimental results on two video summarization datasets demonstrate the effectiveness of MWCVS, showcasing the promising applications of video memorability prediction.

[48] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping

Peter Grönquist,Stepan Tulyakov,Dengxin Dai

Main category: cs.CV

TL;DR: 提出了一种轻量级的神经物理模型（NPM），用于解决多相机间颜色一致性问题，适应性强且计算成本低。

Details

Motivation: 多相机间颜色一致性对图像融合和ISP兼容性至关重要，但现有方法受限于光照变化、高计算成本或不切实际的要求。 Method: NPM通过模拟特定光照下的原始图像来估计设备间的转换，支持物理测量初始化和有无配对数据的训练。 Result: 在NUS和BeyondRGB数据集上，NPM优于现有方法，实现了跨传感器和光学系统的稳健颜色一致性。 Conclusion: NPM是一种高效且适应性强的解决方案，适用于多相机系统的颜色一致性需求。 Abstract: Achieving consistent color reproduction across multiple cameras is essential for seamless image fusion and Image Processing Pipeline (ISP) compatibility in modern devices, but it is a challenging task due to variations in sensors and optics. Existing raw-to-raw conversion methods face limitations such as poor adaptability to changing illumination, high computational costs, or impractical requirements such as simultaneous camera operation and overlapping fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight, physically-informed approach that simulates raw images under specified illumination to estimate transformations between devices. The NPM effectively adapts to varying illumination conditions, can be initialized with physical measurements, and supports training with or without paired data. Experiments on public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent state-of-the-art methods, providing robust chromatic consistency across different sensors and optical systems.

[49] LLaVA-c: Continual Improved Visual Instruction Tuning

Wenzhuo Liu,Fei Zhu,Haiyang Guo,Longhui Wei,Cheng-Lin Liu

Main category: cs.CV

TL;DR: 论文提出了一种改进LLaVA-1.5的方法，通过谱感知巩固和无监督查询正则化解决多任务学习中的任务平衡和基础模型退化问题，实验表明该方法在持续学习中表现优异。

Details

Motivation: 多任务学习存在任务平衡和扩展成本问题，持续学习虽能增量获取知识但易导致基础模型退化。本文旨在解决这些问题。 Method: 在LLaVA-1.5基础上引入谱感知巩固和无监督查询正则化，优化任务平衡并防止基础模型退化。 Result: LLaVA-c在持续预训练和微调中均表现优异，任务级持续学习效果可媲美多任务联合学习。 Conclusion: 该方法简单有效，首次证明任务级持续学习可达到或超越多任务联合学习的效果。 Abstract: Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.

[50] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Juan Yeo,Soonwoo Cha,Jiwoo Song,Hyunbin Jin,Taesup Kim

Main category: cs.CV

TL;DR: 论文提出了一种名为ATAS的自蒸馏方法，旨在提升CLIP模型在细粒度和语义一致性上的表现，无需额外模块或有监督微调。

Details

Motivation: CLIP模型在细粒度、区域级理解上表现不足，影响了其在密集预测任务中的效果。 Method: 通过Any-to-Any Self-Distillation（ATAS）方法，利用模型自身的知识在所有表示层级上同时增强语义一致性和细粒度对齐。 Result: 在开放词汇目标检测和语义分割任务中，ATAS显著提升了性能，超越了基线CLIP模型。 Conclusion: ATAS方法验证了同时保持语义一致性和细粒度对齐对提升开放词汇密集预测任务的重要性。 Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.

[51] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities

Hugo Porta,Emanuele Dalsasso,Jessica L. McCarty,Devis Tuia

Main category: cs.CV

TL;DR: 加拿大2023年经历了近年来最严重的野火季节，凸显了气候变化对火灾季节长度和严重性的影响。研究提出了一种基于多模态数据的高分辨率（100米）野火预测方法，并展示了其优于单模态方法的性能。

Details

Motivation: 气候变化导致野火季节延长和加剧，亟需为北方生态系统社区提供更好的野火管理工具。高分辨率野火概率图是理解野火发生可能性和未来严重性的重要工具。 Method: 利用高分辨率多光谱卫星图像（Sentinel-2 L1C）、中分辨率卫星产品（MODIS）和环境因素（ERA5再分析数据），开发了两种深度学习架构，进行多模态高分辨率野火预测。 Result: 多模态时间输入在所有指标上优于单模态输入，在2023年野火季节（模型未训练过的数据）中F1分数达到60.3%。 Conclusion: 多模态深度学习模型在高分辨率和大陆尺度野火预测中具有潜力。 Abstract: Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around $\sim 0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.

[52] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Congzhi Zhang,Jiawei Peng,Zhenglin Wang,Yilong Lai,Haowen Sun,Heng Chang,Fei Ma,Weijiang Yu

Main category: cs.CV

TL;DR: VReST是一种无需训练的方法，通过蒙特卡洛树搜索和自奖励机制提升大型视觉语言模型在复杂视觉推理中的表现。

Details

Motivation: 大型视觉语言模型在多模态任务中表现优异，但在复杂视觉推理中仍受限，尤其是使用思维链提示技术时。 Method: VReST通过蒙特卡洛树搜索构建推理树，结合自奖励机制评估推理步骤的质量，无需额外模型。 Result: VReST在三个多模态数学推理基准测试中超越现有方法，达到最先进性能。 Conclusion: VReST验证了多模态任务中测试时间扩展律的有效性，为未来研究提供了新方向。 Abstract: Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

[53] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

Mohammadreza Salehi,Shashanka Venkataramanan,Ioana Simion,Efstratios Gavves,Cees G. M. Snoek,Yuki M Asano

Main category: cs.CV

TL;DR: 提出了一种基于运动引导的自监督学习框架，通过聚类密集点轨迹学习时空一致的表示，提升了动态场景和遮挡情况下的鲁棒性。

Details

Motivation: 现有方法依赖静态增强，难以处理物体变形、遮挡和相机运动，导致特征学习在时间上不一致。 Method: 利用现成的点跟踪器提取长程运动轨迹，通过动量编码器的最优传输机制优化特征聚类，并在跟踪点上传播聚类分配以确保时间一致性。 Result: 在六个图像和视频数据集及四个评估基准上，性能提升了1%至6%。 Conclusion: 该方法通过运动作为隐式监督信号，学习到的表示在帧间具有泛化能力，显著提升了动态场景下的表现。 Abstract: Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main

[54] ArrowPose: Segmentation, Detection, and 5 DoF Pose Estimation Network for Colorless Point Clouds

Frederik Hagelskjaer

Main category: cs.CV

TL;DR: 本文提出了一种快速检测和5自由度姿态估计网络，用于无色点云。姿态估计通过神经网络预测物体的中心和顶部点计算得出。网络在合成数据上训练，在基准数据集上测试，表现优于所有无色方法，推理时间仅250毫秒。

Details

Motivation: 解决无色点云中物体姿态估计的快速性和准确性问题。 Method: 使用神经网络预测物体的中心和顶部点，基于合成数据训练。 Result: 在基准数据集上表现优于其他无色方法，推理时间仅250毫秒。 Conclusion: 该方法在无色点云姿态估计中具有高效性和实用性。 Abstract: This paper presents a fast detection and 5 DoF (Degrees of Freedom) pose estimation network for colorless point clouds. The pose estimation is calculated from center and top points of the object, predicted by the neural network. The network is trained on synthetic data, and tested on a benchmark dataset, where it demonstrates state-of-the-art performance and outperforms all colorless methods. The network is able to run inference in only 250 milliseconds making it usable in many scenarios. Project page with code at arrowpose.github.io

[55] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering

Xiaohan Zhang,Sitong Wang,Yushen Yan,Yi Yang,Mingda Xu,Qi Liu

Main category: cs.CV

TL;DR: 论文提出TraGraph-GS方法，通过轨迹图解决大规模场景中视角合成的挑战，提升渲染精度和效率。

Details

Motivation: 现有方法在分割和合并大规模场景时存在相机轨迹适应性和高斯重叠问题，导致渲染效果不佳。 Method: 提出基于图的场景分割方法，结合正则化约束和渐进渲染策略，优化纹理和远距离物体渲染。 Result: 在四个空中和四个地面数据集上表现优异，PSNR平均提升1.86 dB（空中）和1.62 dB（地面）。 Conclusion: TraGraph-GS显著提升大规模场景的渲染质量和效率，解决了现有方法的局限性。 Abstract: High-quality novel view synthesis for large-scale scenes presents a challenging dilemma in 3D computer vision. Existing methods typically partition large scenes into multiple regions, reconstruct a 3D representation using Gaussian splatting for each region, and eventually merge them for novel view rendering. They can accurately render specific scenes, yet they do not generalize effectively for two reasons: (1) rigid spatial partition techniques struggle with arbitrary camera trajectories, and (2) the merging of regions results in Gaussian overlap to distort texture details. To address these challenges, we propose TraGraph-GS, leveraging a trajectory graph to enable high-precision rendering for arbitrarily large-scale scenes. We present a spatial partitioning method for large-scale scenes based on graphs, which incorporates a regularization constraint to enhance the rendering of textures and distant objects, as well as a progressive rendering strategy to mitigate artifacts caused by Gaussian overlap. Experimental results demonstrate its superior performance both on four aerial and four ground datasets and highlight its remarkable efficiency: our method achieves an average improvement of 1.86 dB in PSNR on aerial datasets and 1.62 dB on ground datasets compared to state-of-the-art approaches.

[56] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

Mengjiao Ma,Qi Ma,Yue Li,Jiahuan Cheng,Runyi Yang,Bin Ren,Nikola Popovic,Mingqiang Wei,Nicu Sebe,Luc Van Gool,Theo Gevers,Martin R. Oswald,Danda Pani Paudel

Main category: cs.CV

TL;DR: 论文提出了首个大规模基准测试，系统评估3D高斯泼溅（3DGS）在3D空间中的表现，并引入新数据集GaussianWorld-49K。

Details

Motivation: 现有方法多局限于2D视图评估，缺乏对3D场景的整体理解。 Method: 提出基准测试，评估三类方法（优化、优化无关、通用方法），并引入新数据集。 Result: 通用方法表现最佳，能快速推理新场景且分割性能优越。 Conclusion: 通用方法结合大数据先验，显著提升3DGS场景理解能力。 Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.

[57] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces

Dieuwertje Alblas,Patryk Rygiel,Julian Suk,Kaj O. Kappe,Marieke Hofman,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 提出了一种基于SE(3)-对称变换器的模型，用于预测腹主动脉瘤（AAA）的生长，通过保留血管表面的解剖结构和几何保真度，提高了预测精度。

Details

Motivation: 当前AAA监测策略仅基于最大直径，忽略了3D形状与生长的复杂关系，导致标准化监测间隔可能不适用。个性化生长预测可优化监测策略。 Method: 使用SE(3)-对称变换器模型，直接在血管模型表面预测AAA生长，结合局部多物理特征。训练数据为24名患者的113次CTA扫描。 Result: 模型预测下一次扫描时的AAA生长，中位直径误差为1.18 mm，并能以0.93的准确率识别两年内需手术的患者。外部验证集表现良好。 Conclusion: 局部定向AAA生长预测可行，可为个性化监测策略提供支持。 Abstract: Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the abdominal aorta. AAAs may rupture, with a survival rate of only 20\%. Current clinical guidelines recommend elective surgical repair when the maximum AAA diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet these criteria are periodically monitored, with surveillance intervals based on the maximum AAA diameter. However, this diameter does not take into account the complex relation between the 3D AAA shape and its growth, making standardized intervals potentially unfit. Personalized AAA growth predictions could improve monitoring strategies. We propose to use an SE(3)-symmetric transformer model to predict AAA growth directly on the vascular model surface enriched with local, multi-physical features. In contrast to other works which have parameterized the AAA shape, this representation preserves the vascular surface's anatomical structure and geometric fidelity. We train our model using a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24 AAA patients at irregularly sampled intervals. After training, our model predicts AAA growth to the next scan moment with a median diameter error of 1.18 mm. We further demonstrate our model's utility to identify whether a patient will become eligible for elective repair within two years (acc = 0.93). Finally, we evaluate our model's generalization on an external validation set consisting of 25 CTAs from 7 AAA patients from a different hospital. Our results show that local directional AAA growth prediction from the vascular surface is feasible and may contribute to personalized surveillance strategies.

[58] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba

Yuhang Wang,Jun Li,Zhijian Wu,Jianhua Xu

Main category: cs.CV

TL;DR: InceptionMamba是一种新的主干架构，通过正交带卷积和瓶颈Mamba模块改进InceptionNeXt的局限性，实现更好的空间建模和全局上下文建模，在分类和下游任务中表现优异。

Details

Motivation: InceptionNeXt在图像分类和下游任务中表现优异，但其一维条带卷积限制了空间依赖捕获能力，且卷积操作的局部性约束不利于全局上下文建模。 Method: 提出InceptionMamba架构，用正交带卷积替代传统一维条带卷积，并通过瓶颈Mamba模块实现全局上下文建模。 Result: 在分类和下游任务中，InceptionMamba实现了最先进的性能，并具有优越的参数和计算效率。 Conclusion: InceptionMamba通过改进的空间建模和全局上下文建模，显著提升了性能，代码已开源。 Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.

[59] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation

Jiayi Song,Kaiyu Li,Xiangyong Cao,Deyu Meng

Main category: cs.CV

TL;DR: 论文提出RS-MTDF框架，利用预训练的视觉基础模型（VFMs）作为多教师，通过特征级蒸馏提升半监督遥感图像分割性能，在多个数据集上达到SOTA。

Details

Motivation: 遥感图像语义分割依赖大量标注数据，但标注成本高。半监督方法虽能缓解数据依赖，但现有方法因标记与未标记数据分布不匹配而泛化能力不足。 Method: 提出RS-MTDF框架，利用冻结的VFMs（如DINOv2和CLIP）作为多教师，通过特征级蒸馏对齐学生模型特征，并将知识融合到学生解码器中。 Result: 在ISPRS Potsdam、LoveDA和DeepGlobe数据集上表现优异，尤其在LoveDA上不同标注比例下均超越现有方法，多数语义类别IoU最高。 Conclusion: 多教师VFM指导显著提升了遥感分割的泛化能力和语义理解，消融实验验证了各模块的有效性。 Abstract: Semantic segmentation in remote sensing images is crucial for various applications, yet its performance is heavily reliant on large-scale, high-quality pixel-wise annotations, which are notoriously expensive and time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a promising alternative to mitigate this data dependency. However, existing SSS methods often struggle with the inherent distribution mismatch between limited labeled data and abundant unlabeled data, leading to suboptimal generalization. We propose that Vision Foundation Models (VFMs), pre-trained on vast and diverse datasets, possess robust generalization capabilities that can effectively bridge this distribution gap and provide strong semantic priors for SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and Fusion), a novel framework that leverages the powerful semantic knowledge embedded in VFMs to guide semi-supervised learning in remote sensing. Specifically, RS-MTDF employs multiple frozen VFMs (\textit{e.g.}, DINOv2 and CLIP) as expert teachers, utilizing feature-level distillation to align student features with their robust representations. To further enhance discriminative power, the distilled knowledge is seamlessly fused into the student decoder. Extensive experiments on three challenging remote sensing datasets (ISPRS Potsdam, LoveDA, and DeepGlobe) demonstrate that RS-MTDF consistently achieves state-of-the-art performance. Notably, our method outperforms existing approaches across various label ratios on LoveDA and secures the highest IoU in the majority of semantic categories. These results underscore the efficacy of multi-teacher VFM guidance in significantly enhancing both generalization and semantic understanding for remote sensing segmentation. Ablation studies further validate the contribution of each proposed module.

[60] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting

Keyi Liu,Weidong Yang,Ben Fei,Ying He

Main category: cs.CV

TL;DR: Gaussian2Scene是一种新型的自监督学习框架，利用3D高斯泼溅（3DGS）进行点云预训练，解决了现有方法依赖隐式场景表示和高内存需求的问题。

Details

Motivation: 现有自监督学习方法在场景级别依赖于RGB-D图像的重建信号，但受限于隐式表示和高内存需求，且难以捕捉3D几何结构。 Method: 提出Gaussian2Scene框架，采用3DGS进行高效且显式的3D场景重建，分两阶段训练：第一阶段学习2D和3D场景表示，第二阶段利用重建点云和高斯基元的几何位置进行监督学习。 Result: 在多个下游3D目标检测任务中表现优于现有预训练方法。 Conclusion: Gaussian2Scene通过显式3D重建和跨模态学习，提升了模型的几何理解能力，为3D视觉任务提供了更高效的预训练方案。 Abstract: Self-supervised learning (SSL) for point cloud pre-training has become a cornerstone for many 3D vision tasks, enabling effective learning from large-scale unannotated data. At the scene level, existing SSL methods often incorporate volume rendering into the pre-training framework, using RGB-D images as reconstruction signals to facilitate cross-modal learning. This strategy promotes alignment between 2D and 3D modalities and enables the model to benefit from rich visual cues in the RGB-D inputs. However, these approaches are limited by their reliance on implicit scene representations and high memory demands. Furthermore, since their reconstruction objectives are applied only in 2D space, they often fail to capture underlying 3D geometric structures. To address these challenges, we propose Gaussian2Scene, a novel scene-level SSL framework that leverages the efficiency and explicit nature of 3D Gaussian Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the computational burden associated with volume rendering but also supports direct 3D scene reconstruction, thereby enhancing the geometric understanding of the backbone network. Our approach follows a progressive two-stage training strategy. In the first stage, a dual-branch masked autoencoder learns both 2D and 3D scene representations. In the second stage, we initialize training with reconstructed point clouds and further supervise learning using the geometric locations of Gaussian primitives and rendered RGB images. This process reinforces both geometric and cross-modal learning. We demonstrate the effectiveness of Gaussian2Scene across several downstream 3D object detection tasks, showing consistent improvements over existing pre-training methods.

[61] Landsat-Bench: Datasets and Benchmarks for Landsat Foundation Models

Isaac Corley,Lakshay Sharma,Ruth Crasto

Main category: cs.CV

TL;DR: Landsat-Bench是一套基于Landsat影像的基准测试，用于评估地理空间基础模型（GFM）的性能，结果表明SSL4EO-L预训练的GFM优于ImageNet。

Details

Motivation: Landsat数据缺乏基准测试，限制了基于Landsat的地理空间基础模型的发展。 Method: 引入Landsat-Bench，包含三个基准测试（EuroSAT-L、BigEarthNet-L和LC100-L），并采用标准化评估方法。 Result: SSL4EO-L预训练的GFM在下游任务中表现优于ImageNet，性能提升分别为+4% OA和+5.1% mAP。 Conclusion: Landsat-Bench为Landsat数据提供了有效的基准测试，SSL4EO-L预训练的GFM具有更好的表征能力。 Abstract: The Landsat program offers over 50 years of globally consistent Earth imagery. However, the lack of benchmarks for this data constrains progress towards Landsat-based Geospatial Foundation Models (GFM). In this paper, we introduce Landsat-Bench, a suite of three benchmarks with Landsat imagery that adapt from existing remote sensing datasets -- EuroSAT-L, BigEarthNet-L, and LC100-L. We establish baseline and standardized evaluation methods across both common architectures and Landsat foundation models pretrained on the SSL4EO-L dataset. Notably, we provide evidence that SSL4EO-L pretrained GFMs extract better representations for downstream tasks in comparison to ImageNet, including performance gains of +4% OA and +5.1% mAP on EuroSAT-L and BigEarthNet-L.

[62] HomographyAD: Deep Anomaly Detection Using Self Homography Learning

Jongyub Seok,Chanjin Kang

Main category: cs.CV

TL;DR: 提出了一种基于ImageNet预训练网络的新方法HomographyAD，用于解决现有异常检测方法在真实工业环境中性能不足的问题。

Details

Motivation: 现有异常检测方法仅适用于完全对齐的数据集，而真实工业环境中的数据通常未对齐，因此需要一种更适应实际场景的方法。 Method: 首先通过深度单应性估计方法对齐输入前景，然后通过自单应性学习微调模型以学习正常样本的形状信息，最后通过测试样本特征与正常特征分布的距离进行异常检测。 Result: 实验表明，该方法能显著提升现有异常检测方法的性能。 Conclusion: HomographyAD是一种适用于真实工业环境的高效异常检测方法。 Abstract: Anomaly detection (AD) is a task that distinguishes normal and abnormal data, which is important for applying automation technologies of the manufacturing facilities. For MVTec dataset that is a representative AD dataset for industrial environment, many recent works have shown remarkable performances. However, the existing anomaly detection works have a limitation of showing good performance for fully-aligned datasets only, unlike real-world industrial environments. To solve this limitation, we propose HomographyAD, a novel deep anomaly detection methodology based on the ImageNet-pretrained network, which is specially designed for actual industrial dataset. Specifically, we first suggest input foreground alignment using the deep homography estimation method. In addition, we fine-tune the model by self homography learning to learn additional shape information from normal samples. Finally, we conduct anomaly detection based on the measure of how far the feature of test sample is from the distribution of the extracted normal features. By applying our proposed method to various existing AD approaches, we show performance enhancement through extensive experiments.

[63] A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory

Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于偏微分方程（PDE）的单幅图像去雾新框架，结合大气散射模型、非局部正则化和暗通道先验，改进了PDE模型，并通过理论证明和实验验证了其有效性。

Details

Motivation: 现有去雾方法在处理复杂场景时效果有限，需要一种更鲁棒且高效的解决方案。 Method: 提出改进的PDE模型，结合边缘保持扩散系数、高斯卷积算子和自适应正则化参数，并通过Lax-Milgram定理证明弱解的存在唯一性，使用PyTorch GPU加速实现固定点迭代。 Result: 实验结果表明，该方法是一种有前景的去雾解决方案，并可推广至深度学习模型。 Conclusion: 该PDE框架为单幅图像去雾提供了理论支持和高效实现，具有广泛的应用潜力。 Abstract: This paper presents a novel partial differential equation (PDE) framework for single-image dehazing. By integrating the atmospheric scattering model with nonlocal regularization and dark channel prior, we propose the improved PDE: \[ -\text{div}\left(D(\nabla u)\nabla u\right) + \lambda(t) G(u) = \Phi(I,t,A) \] where $D(\nabla u) = (|\nabla u| + \epsilon)^{-1}$ is the edge-preserving diffusion coefficient, $G(u)$ is the Gaussian convolution operator, and $\lambda(t)$ is the adaptive regularization parameter based on transmission map $t$. We prove the existence and uniqueness of weak solutions in $H_0^1(\Omega)$ using Lax-Milgram theorem, and implement an efficient fixed-point iteration scheme accelerated by PyTorch GPU computation. The experimental results demonstrate that this method is a promising deghazing solution that can be generalized to the deep model paradigm.

[64] Flow Diverse and Efficient: Learning Momentum Flow Matching via Stochastic Velocity Field Sampling

Zhiyuan Ma,Ruixun Liu,Sixian Liu,Jianjun Li,Bowen Zhou

Main category: cs.CV

TL;DR: Discretized-RF通过将直线路径离散化为可变速度子路径，引入噪声于速度而非数据，提升了多样性和多尺度噪声建模能力。

Details

Motivation: 解决传统直线路径RF模型的多样性受限和多尺度噪声建模不足的问题。 Method: 提出Discretized-RF，将直线路径离散化为动量场子路径，并在速度上引入噪声以改变方向。 Result: 实验证明该方法能生成多样且高效的轨迹，并持续产生高质量和多样化的结果。 Conclusion: Discretized-RF通过优化速度场噪声，显著提升了RF模型的性能。 Abstract: Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field $\bm v$ between the two distributions $\bm\pi_0$ and $\bm\pi_1$. In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields'') to expand the search space, especially when close to the distribution $p_\text{noise}$. Different from the previous case where noise is directly superimposed on $\bm x$, we introduce noise on the velocity $\bm v$ of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at https://github.com/liuruixun/momentum-fm.

[65] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: HunyuanVideo-HOMA是一个弱条件多模态驱动框架，用于提升人-物交互视频生成的泛化性和可控性，减少对精确输入的依赖。

Details

Motivation: 解决人-物交互视频生成中对精选运动数据的依赖、对新对象/场景的泛化能力有限以及可访问性受限的问题。 Method: 通过稀疏解耦的运动引导，将外观和运动信号编码到多模态扩散变换器的双输入空间，并在共享上下文空间中融合，结合HOI适配器和面部交叉注意力适配器优化训练。 Result: 在弱监督下实现了交互自然性和泛化性的最先进性能，支持文本条件生成和交互式对象操作。 Conclusion: HunyuanVideo-HOMA展示了在多种场景下的多功能性，并通过用户友好的演示界面支持实际应用。 Abstract: To address key limitations in human-object interaction (HOI) video generation -- specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility -- we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.

[66] HiSin: Efficient High-Resolution Sinogram Inpainting via Resolution-Guided Progressive Inference

Jiaze E,Srutarshi Banerjee,Tekin Bicer,Guannan Wang,Yanfu Zhang,Bin Ren

Main category: cs.CV

TL;DR: HiSin是一种基于扩散模型的高效正弦图修复框架，通过分辨率引导的渐进推理实现内存高效修复，减少计算冗余。

Details

Motivation: 高分辨率正弦图修复对CT重建至关重要，但现有扩散模型因内存和计算需求过高而受限。 Method: HiSin采用分辨率引导的渐进推理，先在低分辨率提取全局结构，再在高分辨率处理小补丁，并结合频率感知补丁跳过和结构自适应步长分配。 Result: 实验显示，HiSin峰值内存使用减少31.25%，推理时间减少18.15%，且修复精度不受影响。 Conclusion: HiSin为高分辨率正弦图修复提供了一种高效且准确的解决方案。 Abstract: High-resolution sinogram inpainting is essential for computed tomography reconstruction, as missing high-frequency projections can lead to visible artifacts and diagnostic errors. Diffusion models are well-suited for this task due to their robustness and detail-preserving capabilities, but their application to high-resolution inputs is limited by excessive memory and computational demands. To address this limitation, we propose HiSin, a novel diffusion based framework for efficient sinogram inpainting via resolution-guided progressive inference. It progressively extracts global structure at low resolution and defers high-resolution inference to small patches, enabling memory-efficient inpainting. It further incorporates frequency-aware patch skipping and structure-adaptive step allocation to reduce redundant computation. Experimental results show that HiSin reduces peak memory usage by up to 31.25% and inference time by up to 18.15%, and maintains inpainting accuracy across datasets, resolutions, and mask conditions.

[67] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

Shuyi Zhang,Xiaoshuai Hao,Yingbo Tang,Lingfeng Zhang,Pengwei Wang,Zhongyuan Wang,Hongxuan Ma,Shanghang Zhang

Main category: cs.CV

TL;DR: Video-CoT是一个新的数据集和基准，旨在通过Chain-of-Thought方法提升视频内容理解的时空细节分析能力。

Details

Motivation: 现有的大规模视觉语言模型在视频分析的时空细节捕捉上表现不佳，需要更精细的数据集和方法来填补这一空白。 Method: 引入Video-CoT数据集，包含192,000个细粒度时空问答对和23,000个高质量CoT标注样本，并提供了评估基准。 Result: 实验显示当前模型在时空理解任务上表现不佳，凸显了该任务的挑战性。 Conclusion: Video-CoT为多媒体理解研究开辟了新方向，支持未来智能系统中高级视频分析的发展。 Abstract: Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:https://video-cot.github.io/ .

[68] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak,Mehar Bhatia,Xiaofeng Zhang,Verena Rieser,Lisa Anne Hendricks,Sjoerd van Steenkiste,Yash Goyal,Karolina Stańczak,Aishwarya Agrawal

Main category: cs.CV

TL;DR: 本文通过CulturalFrames基准量化了文本到图像（T2I）模型在文化表现上的不足，发现模型在显性和隐性文化期望上的失败率分别为68%和49%，且现有评估指标与人类判断相关性低。

Details

Motivation: 研究动机是评估T2I模型在多样文化背景下的表现能力，揭示其在文化准确性上的不足。 Method: 方法包括构建CulturalFrames基准，涵盖10个国家、5个社会文化领域，生成3637张图像并收集10k+人类标注，系统评估4种T2I模型的表现。 Result: 结果显示T2I模型在显性和隐性文化期望上的失败率分别为68%和49%，且现有评估指标与人类判断相关性低。 Conclusion: 结论指出T2I模型在文化表现上存在显著不足，需开发更具文化意识的模型和评估方法。 Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

[69] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu,Xinyang Han,Tonghuan Xiao,Jia Ai,Juan Wu,Tong Zhao,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Yingınst

Main category: cs.CV

TL;DR: 本文提出了一种针对超声图像分析的域适应方法，通过微调视觉-语言基础模型并结合大语言模型作为文本优化器，显著提升了模型性能。

Details

Motivation: 超声图像分析中手动标注耗时且易出现不一致性，而现有视觉-语言基础模型在医学影像领域表现不佳，需解决领域差异问题。 Method: 采用大语言模型作为文本优化器，结合特殊设计的适应策略和任务驱动头，对视觉-语言基础模型进行微调。 Result: 在六个超声数据集上的实验表明，该方法在分割和分类任务中优于现有视觉-语言及纯基础模型。 Conclusion: 该方法有效提升了视觉-语言基础模型在超声图像分析中的性能，为医学影像领域提供了新思路。 Abstract: Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at \href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.

Junzhuo Liu,Markus Eckstein,Zhixiang Wang,Friedrich Feuerhake,Dorit Merhof

Main category: cs.CV

TL;DR: 本文提出了一种基于对比学习的深度学习方法，用于从全切片图像预测空间分辨的基因表达，显著提高了预测准确性。

Details

Motivation: 由于空间转录组数据获取成本高，大规模数据难以获得，因此需要一种高效的方法从现有数据中预测基因表达。 Method: 采用对比学习的深度学习框架，从全切片图像预测空间基因表达。 Result: 在六个疾病数据集上评估，预测高表达基因、高变异基因和标记基因的Pearson相关系数分别提高了6.27%、6.11%和11.26%。 Conclusion: 该方法不仅保留了基因间的相关性，还适用于小样本数据集，并展示了在癌症组织定位中的潜在应用。 Abstract: Spatial transcriptomics is a technology that captures gene expression levels at different spatial locations, widely used in tumor microenvironment analysis and molecular profiling of histopathology, providing valuable insights into resolving gene expression and clinical diagnosis of cancer. Due to the high cost of data acquisition, large-scale spatial transcriptomics data remain challenging to obtain. In this study, we develop a contrastive learning-based deep learning method to predict spatially resolved gene expression from whole-slide images. Evaluation across six different disease datasets demonstrates that, compared to existing studies, our method improves Pearson Correlation Coefficient (PCC) in the prediction of highly expressed genes, highly variable genes, and marker genes by 6.27%, 6.11%, and 11.26% respectively. Further analysis indicates that our method preserves gene-gene correlations and applies to datasets with limited samples. Additionally, our method exhibits potential in cancer tissue localization based on biomarker expression.

[71] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Zike Wu,Qi Yan,Xuanyu Yi,Lele Wang,Renjie Liao

Main category: cs.CV

TL;DR: StreamSplat是一个实时重建动态3D场景的框架，解决了未校准视频流的实时处理、动态场景建模和长期稳定性问题。

Details

Motivation: 现有方法难以同时处理未校准输入、动态场景建模和长期稳定性，StreamSplat旨在解决这些问题。 Method: 提出静态编码器中的概率采样机制和动态解码器中的双向变形场，实现高效的动态建模。 Result: 在静态和动态基准测试中表现优于现有方法，支持任意长度视频流的在线重建。 Conclusion: StreamSplat在重建质量和动态建模方面表现优异，适用于实时应用。 Abstract: Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.

[72] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Leqi Shen,Guoqiang Gong,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding

Main category: cs.CV

TL;DR: 论文提出DiscoVLA方法，通过同时减少视觉、语言和对齐的差异，提升视频-文本检索性能。

Details

Motivation: CLIP模型专注于图像级视觉-语言匹配，而视频-文本检索需要视频级理解。现有方法主要关注视觉差异，忽略了语言和对齐的差异。 Method: 提出Image-Video Features Fusion整合图像和视频特征，生成伪图像标题学习细粒度对齐，并通过Image-to-Video Alignment Distillation增强视频级对齐。 Result: 在MSRVTT数据集上，DiscoVLA以CLIP（ViT-B/16）为基础，R@1达到50.5%，优于之前方法1.5%。 Conclusion: DiscoVLA通过全面解决视觉、语言和对齐差异，显著提升了视频-文本检索性能。 Abstract: The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

[73] Product of Experts for Visual Generation

Yunzhi Zhang,Carson Murtuza-Lanier,Zizhang Li,Yilun Du,Jiajun Wu

Main category: cs.CV

TL;DR: 提出了一种基于专家乘积（PoE）的框架，通过训练无关的方法（AIS采样）实现异构模型的知识组合，提升图像和视频合成的可控性和灵活性。

Details

Motivation: 现代神经模型在共享数据域（如图像和视频）上具有丰富的先验和互补知识，但如何整合多样化的知识来源（如生成模型、语言模型、图形引擎等）仍待探索。 Method: 采用专家乘积（PoE）框架，通过Annealed Importance Sampling（AIS）从异构模型的乘积分布中采样，实现知识组合。 Result: 在图像和视频合成任务中表现出实用性，比单一方法更具可控性，并提供灵活的用户界面以指定生成目标。 Conclusion: 该框架为异构模型的知识整合提供了有效的训练无关方法，提升了生成任务的灵活性和可控性。 Abstract: Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

[74] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos

Negin Ghamsarian,Raphael Sznitman,Klaus Schoeffmann,Jens Kowal

Main category: cs.CV

TL;DR: WetCat是首个专为自动化技能评估设计的湿实验室白内障手术视频数据集，填补了现有资源在湿实验室环境中全面技能评估的空白。

Details

Motivation: 传统湿实验室训练依赖人工评估，效率低且主观性强，计算机视觉技术为自动化评估提供了可能，但现有数据集多为真实手术或孤立任务，无法满足湿实验室需求。 Method: WetCat包含高分辨率视频，记录了学员在人工眼球上进行的手术，并提供了详细的阶段注释和关键解剖结构的语义分割。 Result: 该数据集支持在关键手术阶段（如囊膜撕开和超声乳化）进行技能评估，为开发可解释的AI评估工具奠定了基础。 Conclusion: WetCat为眼科手术教育的客观化和规模化提供了新基准，数据集已公开。 Abstract: To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files.

[75] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis

José Morano,Botond Fazekas,Emese Sükei,Ronald Fecso,Taha Emre,Markus Gumpinger,Georg Faustmann,Marzieh Oghbaie,Ursula Schmidt-Erfurth,Hrvoje Bogunović

Main category: cs.CV

TL;DR: MIRAGE是一种新型多模态基础模型，用于分析OCT和SLO图像，并在分类和分割任务中表现优于现有模型。

Details

Motivation: 现有AI模型在眼科图像分析中依赖大量标注数据且泛化能力不足，而现有基础模型缺乏验证且仅支持单一模态。 Method: 提出MIRAGE多模态基础模型，并设计新的评估基准，用于OCT/SLO分类和分割任务。 Result: MIRAGE在分类和分割任务中优于通用和专用基础模型及分割方法。 Conclusion: MIRAGE适合作为开发稳健眼科AI系统的基础，模型和评估基准已公开。 Abstract: Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.

[76] Hyperbolic Dual Feature Augmentation for Open-Environment

Peilin Yu,Yuwei Wu,Zhi Gao,Xiaomeng Fan,Shuo Yang,Yunde Jia

Main category: cs.CV

TL;DR: 提出了一种双曲双重特征增强方法，用于开放环境中的特征增强，提升双曲学习算法的泛化能力。

Details

Motivation: 现有双曲特征增强方法局限于封闭环境，无法处理未见类别，本文旨在解决这一问题。 Method: 结合神经ODE模块和元学习估计特征分布，引入正则化保持数据层次结构，推导损失上界以支持无限增强。 Result: 在五个开放环境任务中显著提升双曲算法的性能。 Conclusion: 该方法有效增强了双曲算法在开放环境中的表现。 Abstract: Feature augmentation generates novel samples in the feature space, providing an effective way to enhance the generalization ability of learning algorithms with hyperbolic geometry. Most hyperbolic feature augmentation is confined to closed-environment, assuming the number of classes is fixed (\emph{i.e.}, seen classes) and generating features only for these classes. In this paper, we propose a hyperbolic dual feature augmentation method for open-environment, which augments features for both seen and unseen classes in the hyperbolic space. To obtain a more precise approximation of the real data distribution for efficient training, (1) we adopt a neural ordinary differential equation module, enhanced by meta-learning, estimating the feature distributions of both seen and unseen classes; (2) we then introduce a regularizer to preserve the latent hierarchical structures of data in the hyperbolic space; (3) we also derive an upper bound for the hyperbolic dual augmentation loss, allowing us to train a hyperbolic model using infinite augmentations for seen and unseen classes. Extensive experiments on five open-environment tasks: class-incremental learning, few-shot open-set recognition, few-shot learning, zero-shot learning, and general image classification, demonstrate that our method effectively enhances the performance of hyperbolic algorithms in open-environment.

[77] SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping

Jiajun Li,Yue Ma,Xinyu Zhang,Qingyan Wei,Songhua Liu,Linfeng Zhang

Main category: cs.CV

TL;DR: 论文分析了VAR模型的推理效率问题，提出两种冗余来源（步骤冗余和无条件分支冗余），并分别设计了自动跳过步骤和无条件分支替换技术。进一步提出SkipVAR框架，动态选择加速策略，实验显示显著加速效果。

Details

Motivation: VAR模型的高频组件或后期步骤导致推理延迟，但相关计算冗余尚未深入研究。 Method: 提出自动跳过步骤策略和无条件分支替换技术，并设计SkipVAR框架动态选择加速策略。 Result: SkipVAR在保持模型质量的同时，实现了1.81倍整体加速和2.62倍GenEval基准加速。 Conclusion: 频率感知、无需训练的自适应加速方法对可扩展的自回归图像生成有效。 Abstract: Recent studies on Visual Autoregressive (VAR) models have highlighted that high-frequency components, or later steps, in the generation process contribute disproportionately to inference latency. However, the underlying computational redundancy involved in these steps has yet to be thoroughly investigated. In this paper, we conduct an in-depth analysis of the VAR inference process and identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy. To address step redundancy, we propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency. For unconditional branch redundancy, we observe that the information gap between the conditional and unconditional branches is minimal. Leveraging this insight, we introduce unconditional branch replacement, a technique that bypasses the unconditional branch to reduce computational cost. Notably, we observe that the effectiveness of acceleration strategies varies significantly across different samples. Motivated by this, we propose SkipVAR, a sample-adaptive framework that leverages frequency information to dynamically select the most suitable acceleration strategy for each instance. To evaluate the role of high-frequency information, we introduce high-variation benchmark datasets that test model sensitivity to fine details. Extensive experiments show SkipVAR achieves over 0.88 average SSIM with up to 1.81x overall acceleration and 2.62x speedup on the GenEval benchmark, maintaining model quality. These results confirm the effectiveness of frequency-aware, training-free adaptive acceleration for scalable autoregressive image generation. Our code is available at https://github.com/fakerone-li/SkipVAR and has been publicly released.

[78] Inherently Faithful Attention Maps for Vision Transformers

Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Diego Marcos

Main category: cs.CV

TL;DR: 提出了一种基于注意力机制的两阶段框架，通过二进制注意力掩码限制预测仅受相关图像区域影响，提高对虚假相关性和分布外背景的鲁棒性。

Details

Motivation: 解决上下文对物体感知的强烈影响，尤其是物体出现在分布外背景时可能导致的有偏表示问题。同时，许多任务需要识别相关区域，但上下文可能引入干扰。 Method: 采用两阶段框架：第一阶段处理完整图像以发现物体部分和任务相关区域；第二阶段利用注意力掩码限制感受野至这些区域，过滤干扰信息。两阶段联合训练，第二阶段可优化第一阶段。 Result: 在多个基准测试中，该方法显著提升了对虚假相关性和分布外背景的鲁棒性。 Conclusion: 提出的两阶段注意力框架有效解决了上下文干扰问题，同时保持了任务相关区域的识别能力。 Abstract: We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

[79] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

David Acuna,Ximing Lu,Jaehun Jung,Hyunwoo Kim,Amlan Kar,Sanja Fidler,Yejin Choi

Main category: cs.CV

TL;DR: 提出一种基于蒙特卡洛树搜索（MCTS）的算法，通过注入子问题-子答案对，激发非推理模型的隐含知识，无需额外训练或监督。

Details

Motivation: 探索是否可以通过搜索机制激发已部署的非推理模型的隐含知识，而无需重新训练。 Method: 采用MCTS启发的算法，通过子问题-子答案对引导模型生成长推理链。 Result: 在三个基准测试中表现一致提升，MMMU-PRO总体提升2%，其中文科领域显著提升9%。 Conclusion: 将推理视为搜索过程，能够有效连接碎片化知识，为非推理模型提供长推理能力。 Abstract: Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning -- akin to the success observed in language models -- via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces -- without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model's output stream. We show that framing reasoning as a search process -- where subquestions act as latent decisions within a broader inference trajectory -- helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.

[80] What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Wendong Bu,Yang Wu,Qifan Yu,Minghe Gao,Bingchen Miao,Zhenkui Zhang,Kaihang Pan,Yunfei Li,Mengze Li,Wei Ji,Juncheng Li,Siliang Tang,Yueting Zhuang

Main category: cs.CV

TL;DR: OmniBench是一个自生成、跨平台、基于图的基准测试工具，通过子任务组合控制任务复杂度，解决了现有基准测试的局限性。OmniEval是多维评估框架，评估虚拟代理的10种能力。数据集包含36k图结构任务，人工接受率达91%。

Details

Motivation: 现有基准测试存在任务复杂度不可控、人工标注场景有限和多维评估缺乏的问题，需要更高效的评估工具。 Method: 提出OmniBench和OmniEval，前者通过子任务组合生成可控复杂度的任务，后者提供多维评估框架。数据集包含36k图结构任务。 Result: 数据集人工接受率达91%，图结构数据比人工标注数据更高效指导代理。多维评估揭示了开源和闭源模型的性能差异。 Conclusion: OmniBench和OmniEval为虚拟代理的多维评估提供了高效工具，推动了未来研究的发展。 Abstract: As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91\% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.

[81] SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation

Hongjie Zhu,Xiwei Liu,Rundong Xue,Zeyu Zhang,Yong Xu,Daji Ergu,Ying Cai,Yang Zhao

Main category: cs.CV

TL;DR: 论文提出了一种名为SSS的半监督学习方法，利用SAM-2的强特征提取能力增强医学图像分割性能，通过多视图特征差异和物理约束提示生成器优化模型。

Details

Motivation: 在信息爆炸时代，如何高效利用大规模未标注数据并减少对高质量标注的依赖是医学图像分析的关键挑战。半监督学习通过知识转移提升性能，结合SAM-2的先验知识，进一步优化特征支持。 Method: 基于单流“弱到强”一致性正则化框架，引入判别性特征增强（DFE）机制，利用多尺度数据增强策略的特征差异建模。同时开发了结合物理约束和滑动窗口（PCSW）的提示生成器，为SAM-2提供输入提示。 Result: 在ACDC和BHSD数据集上的实验表明，SSS方法显著优于现有技术，BHSD上的平均Dice分数达到53.15，比之前最佳方法提升3.65。 Conclusion: SSS方法通过结合SAM-2的特征提取能力和半监督学习框架，有效提升了医学图像分割性能，为未来研究提供了新方向。 Abstract: In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and emerging as a highly promising research direction in medical image analysis. Inspired by the ability of Vision Foundation Models (e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised SAM-2), a novel approach that leverages SAM-2's robust feature extraction capabilities to uncover latent knowledge in unlabeled medical images, thus effectively enhancing feature support for fully supervised medical image segmentation. Specifically, building upon the single-stream "weak-to-strong" consistency regularization framework, this paper introduces a Discriminative Feature Enhancement (DFE) mechanism to further explore the feature discrepancies introduced by various data augmentation strategies across multiple views. By leveraging feature similarity and dissimilarity across multi-scale augmentation techniques, the method reconstructs and models the features, thereby effectively optimizing the salient regions. Furthermore, a prompt generator is developed that integrates Physical Constraints with a Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data, fulfilling SAM-2's requirement for additional prompts. Extensive experiments demonstrate the superiority of the proposed method for semi-supervised medical image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably, SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous state-of-the-art method by +3.65 Dice. Code will be available at https://github.com/AIGeeksGroup/SSS.

[82] Cross-Spectral Body Recognition with Side Information Embedding: Benchmarks on LLCM and Analyzing Range-Induced Occlusions on IJB-MDF

Anirudh Nanduri,Siyuan Huang,Rama Chellappa

Main category: cs.CV

TL;DR: 本文探讨了如何将预训练的Vision Transformer（ViT）模型应用于跨光谱人体识别任务，通过引入Side Information Embedding（SIE）提升性能，并分析了遮挡对可见-红外（VI）再识别的影响。

Details

Motivation: 跨光谱人体识别（如可见光和红外图像匹配）是一个具有挑战性的任务，现有方法在遮挡处理方面研究不足。本文旨在通过ViT和SIE提升性能，并填补遮挡研究的空白。 Method: 使用预训练的ViT模型，并集成SIE编码相机和域信息。通过LLCM和IJB-MDF数据集评估性能，特别关注遮挡问题。 Result: 仅编码相机信息（不显式包含域信息）在LLCM数据集上达到了最先进的性能。遮挡分析揭示了跨范围、跨光谱匹配的挑战。 Conclusion: ViT结合SIE在跨光谱识别中表现优异，但遮挡问题仍需进一步研究，尤其是在VI-ReID任务中。 Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.

[83] Segment Concealed Objects with Incomplete Supervision

Chunming He,Kai Li,Yachao Zhang,Ziyun Yang,Youwei Pang,Longxiang Tang,Chengyu Fang,Yulun Zhang,Linghe Kong,Xiu Li,Sina Farsiu

Main category: cs.CV

TL;DR: 论文提出了一种统一方法SEE，用于不完全监督的隐蔽物体分割（ISCOS），通过结合Segment Anything Model（SAM）生成伪标签，并设计策略优化伪标签质量和特征分组，显著提升了分割性能。

Details

Motivation: 隐蔽物体分割任务因不完全标注数据和物体与背景高度相似而极具挑战性，需要一种统一方法解决监督不足和分割困难的问题。 Method: 提出SEE框架，利用SAM生成伪标签，并通过伪标签生成、存储和优化策略提升质量；设计混合粒度特征分组模块增强分割一致性。 Result: 实验表明SEE在多个ISCOS任务中达到最先进性能，并可作为即插即用方案提升现有模型。 Conclusion: SEE通过结合SAM和优化策略，有效解决了ISCOS中的监督不足和分割困难问题，具有广泛应用潜力。 Abstract: Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, ``\emph{Segment Anything Model (SAM)}'', to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.

[84] Data Augmentation For Small Object using Fast AutoAugment

DaeEun Yoon,Semin Kim,SangWook Yoo,Jongha Lee

Main category: cs.CV

TL;DR: 提出了一种基于Fast AutoAugment的数据增强方法，显著提升了小目标检测性能。

Details

Motivation: 小目标检测性能远低于大目标，是计算机视觉中的重要挑战。 Method: 使用Fast AutoAugment快速找到最优数据增强策略，以克服小目标检测中的性能下降问题。 Result: 在DOTA数据集上实现了20%的性能提升。 Conclusion: 该方法有效解决了小目标检测性能不足的问题。 Abstract: In recent years, there has been tremendous progress in object detection performance. However, despite these advances, the detection performance for small objects is significantly inferior to that of large objects. Detecting small objects is one of the most challenging and important problems in computer vision. To improve the detection performance for small objects, we propose an optimal data augmentation method using Fast AutoAugment. Through our proposed method, we can quickly find optimal augmentation policies that can overcome degradation when detecting small objects, and we achieve a 20% performance improvement on the DOTA dataset.

[85] ORIDa: Object-centric Real-world Image Composition Dataset

Jinwoo Kim,Sangmin Han,Jinho Jeong,Jiwoo Choi,Dongyoung Kim,Seon Joo Kim

Main category: cs.CV

TL;DR: ORIDa是一个大规模、真实捕获的数据集，用于对象合成任务，包含超过30,000张图像和200个独特对象，支持多样化的场景和位置。

Details

Motivation: 现有数据集缺乏多样性和规模，无法全面探索真实世界场景，ORIDa填补了这一空白。 Method: ORIDa包含两种数据类型：事实-反事实集（每组5张图像）和事实场景（单张图像），覆盖多样化的环境和对象位置。 Result: ORIDa是首个公开的具有如此规模和复杂性的真实世界图像合成数据集。 Conclusion: ORIDa为对象合成研究的进一步推进提供了宝贵资源。 Abstract: Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models. However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.

[86] ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations

Amirreza Rouhi,Solmaz Arezoomandan,Knut Peterson,Joseph T. Woods,David K. Han

Main category: cs.CV

TL;DR: ADAM是一种无需训练的自优化框架，用于开放世界中的物体标注，利用LLM生成候选标签并结合CLIP视觉嵌入，通过检索和投票机制为未知物体分配标签。

Details

Motivation: 解决传统物体检测模型依赖预定义类别、无法识别开放世界中新物体的问题。 Method: 结合LLM生成候选标签和CLIP视觉嵌入构建Embedding-Label Repository，通过检索、投票和自优化机制标注未知物体。 Result: 在COCO和PASCAL数据集上验证了ADAM能有效标注新类别，无需微调或重新训练。 Conclusion: ADAM为开放世界物体标注提供了一种无需训练的高效解决方案。 Abstract: Object detection models typically rely on predefined categories, limiting their ability to identify novel objects in open-world scenarios. To overcome this constraint, we introduce ADAM: Autonomous Discovery and Annotation Model, a training-free, self-refining framework for open-world object labeling. ADAM leverages large language models (LLMs) to generate candidate labels for unknown objects based on contextual information from known entities within a scene. These labels are paired with visual embeddings from CLIP to construct an Embedding-Label Repository (ELR) that enables inference without category supervision. For a newly encountered unknown object, ADAM retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and k-nearest-neighbor-based majority re-labeling. Experimental results on the COCO and PASCAL datasets demonstrate that ADAM effectively annotates novel categories using only visual and contextual signals, without requiring any fine-tuning or retraining.

[87] Rethinking Range-View LiDAR Segmentation in Adverse Weather

Longyu Yang,Ping Hu,Lu Zhang,Jun Liu,Yap-Peng Tan,Heng Tao Shen,Xiaofeng Zhu

Main category: cs.CV

TL;DR: 本文提出了一种轻量级模块化框架，通过分离处理几何属性和反射强度，提升了LiDAR分割在恶劣天气下的泛化性能。

Details

Motivation: 范围视图LiDAR分割在恶劣天气下的泛化性能不足，限制了其在实际环境中的可靠性。 Method: 将标准范围视图网络的初始块重构为两个分支，分别处理几何属性和反射强度，并引入GAS和RDC模块抑制噪声和校准失真。 Result: 实验表明，该方法显著提升了恶劣天气下的泛化性能，且推理开销极小。 Conclusion: 该框架为实际LiDAR分割提供了一种实用且高效的解决方案。 Abstract: LiDAR segmentation has emerged as an important task to enrich multimedia experiences and analysis. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation.

[88] Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

Chenyu Lian,Hong-Yu Zhou,Dongyun Liang,Jing Qin,Liansheng Wang

Main category: cs.CV

TL;DR: ALTA是一种高效的医学视觉-语言对齐方法，通过适应预训练的视觉模型，显著提升了检索和零样本分类任务的性能。

Details

Motivation: 传统跨模态对比学习方法在视觉表示能力上表现不佳，而多模态掩码建模模型虽视觉表示能力强，但跨模态匹配效果差。ALTA旨在解决这一矛盾。 Method: ALTA通过适应预训练的视觉模型（来自掩码记录建模），结合时间多视角放射图像输入，提升视觉-语言对齐效果。 Result: ALTA在文本到图像和图像到文本检索任务中分别比最佳对比方法高出4%和6%的绝对准确率。 Conclusion: ALTA不仅高效，还提升了视觉和语言理解能力，代码已开源。 Abstract: Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.

[89] Do Concept Replacement Techniques Really Erase Unacceptable Concepts?

Anudeep Das,Gurjot Singh,Prach Chantasantitam,N. Asokan

Main category: cs.CV

TL;DR: 本文探讨了生成模型中概念替换技术（CRTs）的局限性，特别是在图像到图像（I2I）场景中的失效问题，并提出了一种名为AntiMirror的新技术以同时实现有效性和保真度。

Details

Motivation: 现有的CRTs在文本到图像（T2I）模型中表现良好，但在I2I场景中无法有效移除不可接受的概念，且忽视了保真度问题。 Method: 通过I2I模型实证分析现有CRTs的不足，并提出AntiMirror技术，利用定向图像编辑实现概念替换。 Result: 现有CRTs在I2I场景中失效，AntiMirror技术展示了在移除不可接受概念的同时保持其他概念保真度的可行性。 Conclusion: 研究强调了CRTs在I2I场景中的局限性，并提出了AntiMirror作为解决方案，为未来研究提供了方向。 Abstract: Generative models, particularly diffusion-based text-to-image (T2I) models, have demonstrated astounding success. However, aligning them to avoid generating content with unacceptable concepts (e.g., offensive or copyrighted content, or celebrity likenesses) remains a significant challenge. Concept replacement techniques (CRTs) aim to address this challenge, often by trying to "erase" unacceptable concepts from models. Recently, model providers have started offering image editing services which accept an image and a text prompt as input, to produce an image altered as specified by the prompt. These are known as image-to-image (I2I) models. In this paper, we first use an I2I model to empirically demonstrate that today's state-of-the-art CRTs do not in fact erase unacceptable concepts. Existing CRTs are thus likely to be ineffective in emerging I2I scenarios, despite their proven ability to remove unwanted concepts in T2I pipelines, highlighting the need to understand this discrepancy between T2I and I2I settings. Next, we argue that a good CRT, while replacing unacceptable concepts, should preserve other concepts specified in the inputs to generative models. We call this fidelity. Prior work on CRTs have neglected fidelity in the case of unacceptable concepts. Finally, we propose the use of targeted image-editing techniques to achieve both effectiveness and fidelity. We present such a technique, AntiMirror, and demonstrate its viability.

Fabian Immel,Jan-Hendrik Pauls,Richard Fehler,Frank Bieder,Jonas Merkert,Christoph Stiller

Main category: cs.CV

TL;DR: SDTagNet利用标准地图（SD）作为先验信息，结合NLP特征和点级编码器，显著提升了高精地图（HD）的远距离检测精度。

Details

Motivation: 高精地图维护成本高，现有在线构建方法受限于传感器感知范围，需利用更易维护的SD地图提升性能。 Method: 结合SD地图的语义信息和NLP特征，引入点级编码器和正交元素标识符，统一整合各类地图元素。 Result: 在Argoverse 2和nuScenes数据集上，性能提升显著（最高+5.9 mAP，+45%）。 Conclusion: SDTagNet通过充分利用SD地图信息，显著提升了HD地图构建的精度和效率。 Abstract: Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet

[91] Do MIL Models Transfer?

Daniel Shao,Richard J. Chen,Andrew H. Song,Joel Runevic,Ming Y. Lu,Tong Ding,Faisal Mahmood

Main category: cs.CV

TL;DR: 研究表明，预训练的多实例学习（MIL）模型在计算病理学中具有强大的迁移学习能力，即使在不同器官上预训练，也能显著优于从头训练的模型。

Details

Motivation: 计算病理学中，MIL模型在弱监督和小数据集下表现不佳，而迁移学习的潜力尚未充分探索。 Method: 系统评估了11个MIL模型在21个预训练任务中的表现，涵盖形态学和分子亚型预测。 Result: 预训练MIL模型在不同器官和目标任务上表现优异，泛化能力强，且所需预训练数据较少。 Conclusion: MIL模型具有强大的适应性，迁移学习可显著提升计算病理学任务的性能，同时提供了标准化资源和预训练权重。 Abstract: Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab

[92] DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging

Felix Wagner,Pramit Saha,Harry Anthony,J. Alison Noble,Konstantinos Kamnitsas

Main category: cs.CV

TL;DR: 论文提出了一种名为DIsoN的去中心化OOD检测框架，通过交换模型参数而非数据，解决了训练数据隐私和共享问题，并在医疗影像数据上验证了其有效性。

Details

Motivation: 在安全关键领域（如医疗影像）部署ML模型时，需检测训练数据未覆盖的输入（OOD），但现有方法因数据隐私和共享限制无法直接比较训练和测试数据。 Method: 提出Isolation Network框架，通过二元分类任务量化测试样本与训练数据的分离难度，并扩展为DIsoN，支持去中心化参数交换和类条件比较。 Result: 在四个医疗影像数据集上的12个OOD检测任务中，DIsoN表现优于现有方法，同时保护数据隐私。 Conclusion: DIsoN为ML开发者提供了一种新的服务模式，支持远程安全利用训练数据进行OOD检测。 Abstract: Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code will be available upon acceptance at: *****

[93] Diffuse and Disperse: Image Generation with Representation Regularization

Runqian Wang,Kaiming He

Main category: cs.CV

TL;DR: 提出了一种名为Dispersive Loss的简单插件式正则化方法，用于改进扩散生成模型，无需预训练或额外参数。

Details

Motivation: 扩散生成模型通常缺乏显式正则化，且与表示学习进展独立发展。希望通过Dispersive Loss弥合生成建模与表示学习之间的差距。 Method: 提出Dispersive Loss，通过鼓励隐藏空间中的表示分散来正则化模型，无需正样本对，不影响回归采样过程。 Result: 在ImageNet数据集上评估，Dispersive Loss在多种模型中均表现优于基线方法。 Conclusion: Dispersive Loss是一种简单有效的正则化方法，有望促进生成建模与表示学习的结合。 Abstract: The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.

[94] Princeton365: A Diverse Dataset with Accurate Camera Pose

Karhan Kayan,Stamatis Alexandropoulos,Rishabh Jain,Yiming Zuo,Erich Liang,Jia Deng

Main category: cs.CV

TL;DR: Princeton365是一个大规模多样化的视频数据集，包含365个带有精确相机姿态的视频，填补了当前SLAM基准测试中精度与数据多样性之间的差距。

Details

Motivation: 解决现有SLAM基准测试中精度与数据多样性不足的问题，并提供更全面的评估指标。 Method: 通过校准板和360相机收集室内、室外和物体扫描视频，同步单目和立体RGB视频及IMU数据，并提出基于光流的新评估指标。 Result: 数据集和评估指标支持跨场景SLAM方法比较，并提出了新的挑战性新视角合成基准。 Conclusion: Princeton365为SLAM研究提供了更全面的数据和评估工具，有助于分析方法的失败模式。 Abstract: We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission.

[95] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Dianyi Wang,Wei Song,Yikun Wang,Siyuan Wang,Kaicheng Yu,Zhongyu Wei,Jiaqi Wang

Main category: cs.CV

TL;DR: 论文提出ASVR方法，通过自回归语义视觉重建联合学习视觉和文本模态，显著提升多模态理解性能。

Details

Motivation: 现有大型视觉语言模型仅对文本序列进行自回归监督，未能充分利用视觉模态，导致无法处理无标注图像、遗漏视觉细节及难以表达视觉中心内容。 Method: 引入ASVR，在统一自回归框架中联合学习视觉和文本模态，通过自回归重建图像的语义表示而非原始外观。 Result: ASVR在多种数据规模和LLM骨干上均表现优异，如将LLaVA-1.5在14个多模态基准上的平均分提升5%。 Conclusion: 自回归语义视觉重建能稳定提升多模态理解，且适用于不同数据规模和模型架构。 Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

[96] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Xuanchi Ren,Yifan Lu,Tianshi Cao,Ruiyuan Gao,Shengyu Huang,Amirmojtaba Sabour,Tianchang Shen,Tobias Pfaff,Jay Zhangjie Wu,Runjian Chen,Seung Wook Kim,Jun Gao,Laura Leal-Taixe,Mike Chen,Sanja Fidler,Huan Ling

Main category: cs.CV

TL;DR: 论文提出了一种名为Cosmos-Drive-Dreams的合成数据生成（SDG）管道，用于生成具有挑战性的驾驶场景，以解决真实数据收集和标注的高成本问题。

Details

Motivation: 为自动驾驶系统（AV）收集和标注真实数据耗时且昂贵，尤其是难以捕捉对训练和测试至关重要的边缘案例。 Method: 利用NVIDIA Cosmos世界基础模型，开发了Cosmos-Drive，一个可控、高保真、多视角且时空一致的驾驶视频生成模型套件，构建了Cosmos-Drive-Dreams管道。 Result: 实验表明，生成的数据有助于缓解长尾分布问题，并提升下游任务（如3D车道检测、3D目标检测和驾驶策略学习）的泛化能力。 Conclusion: Cosmos-Drive-Dreams通过开源工具包、数据集和模型权重，为自动驾驶领域提供了高效的数据生成解决方案。 Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

[97] MagCache: Fast Video Generation with Magnitude-Aware Cache

Zehong Ma,Longhui Wei,Feng Wang,Shiliang Zhang,Qi Tian

Main category: cs.CV

TL;DR: 论文提出了一种基于统一幅度规律的视频扩散模型加速方法MagCache，通过自适应跳过不重要时间步和缓存策略，显著提升了速度和视觉保真度。

Details

Motivation: 现有加速方法依赖统一启发式或时间嵌入变体，容易因提示特定过拟合导致输出不一致，且需大量校准样本。 Method: 发现不同模型和提示下残差输出幅度比单调递减的规律，提出MagCache，利用误差建模和自适应缓存策略跳过不重要时间步。 Result: MagCache在Open-Sora和Wan 2.1上分别实现2.1倍和2.68倍加速，且在LPIPS、SSIM和PSNR上优于现有方法。 Conclusion: MagCache是一种高效且鲁棒的加速方法，仅需单一样本校准，显著提升了视频扩散模型的性能。 Abstract: Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets.

cs.GR [Back]

[98] Neural-Augmented Kelvinlet: Real-Time Soft Tissue Deformation with Multiple Graspers

Ashkan Shahbazi,Kyvia Pereira,Jon S. Heiselman,Elaheh Akbari,Annie C. Benson,Sepehr Seifi,Xinyuan Liu,Garrison L. Johnston,Erwin Terpstra,Anne Draaisma,Jan-Jaap Severes,Jie Ying Wu,Nabil Simaan,Michael L. Miga,Soheil Kolouri

Main category: cs.GR

TL;DR: 本文提出了一种基于物理信息的神经模拟器，通过结合Kelvinlet先验和大规模FEM模拟，实现了高精度、低延迟的软组织变形实时模拟。

Details

Motivation: 快速准确的软组织变形模拟对手术机器人和医学培训至关重要。 Method: 结合Kelvinlet先验和FEM模拟，通过残差学习和正则化提升神经网络预测的准确性和物理一致性。 Result: 方法在模拟标准腹腔镜组织抓取工具时表现出高保真度，验证了其有效性。 Conclusion: Kelvinlet增强学习是一种高效且强大的策略，适用于手术中的实时物理感知软组织模拟。 Abstract: Fast and accurate simulation of soft tissue deformation is a critical factor for surgical robotics and medical training. In this paper, we introduce a novel physics-informed neural simulator that approximates soft tissue deformations in a realistic and real-time manner. Our framework integrates Kelvinlet-based priors into neural simulators, making it the first approach to leverage Kelvinlets for residual learning and regularization in data-driven soft tissue modeling. By incorporating large-scale Finite Element Method (FEM) simulations of both linear and nonlinear soft tissue responses, our method improves neural network predictions across diverse architectures, enhancing accuracy and physical consistency while maintaining low latency for real-time performance. We demonstrate the effectiveness of our approach by performing accurate surgical maneuvers that simulate the use of standard laparoscopic tissue grasping tools with high fidelity. These results establish Kelvinlet-augmented learning as a powerful and efficient strategy for real-time, physics-aware soft tissue simulation in surgical applications.

[99] A Real-time 3D Desktop Display

Livio Tenze,Enrique Canessa

Main category: cs.GR

TL;DR: 本文介绍了altiro3D C++库的扩展版本，支持从2D图像或视频流实时生成3D光场，利用AI技术提升性能，并实现了多平台GUI简化操作。

Details

Motivation: 目标是扩展altiro3D库的功能，使其能够处理3D视频流，并通过AI技术提升实时3D渲染的性能和用户体验。 Method: 使用MiDaS CNN从单张2D图像提取深度图，结合AI计算技术优化库性能，并开发多平台GUI简化屏幕区域选择。 Result: 扩展后的altiro3D库能够实时处理2D图像、视频流或桌面屏幕区域，并渲染为3D光场，支持如Looking Glass等设备。 Conclusion: 通过AI技术和GUI改进，altiro3D库成功实现了更高效的3D内容生成和更友好的用户交互。 Abstract: A new extended version of the altiro3D C++ Library -- initially developed to get glass-free holographic displays starting from 2D images -- is here introduced aiming to deal with 3D video streams from either 2D webcam images or flat video files. These streams are processed in real-time to synthesize light-fields (in Native format) and feed realistic 3D experiences. The core function needed to recreate multiviews consists on the use of MiDaS Convolutional Neural Network (CNN), which allows to extract a depth map from a single 2D image. Artificial Intelligence (AI) computing techniques are applied to improve the overall performance of the extended altiro3D Library. Thus, altiro3D can now treat standard images, video streams or screen portions of a Desktop where other apps may be also running (like web browsers, video chats, etc) and render them into 3D. To achieve the latter, a screen region need to be selected in order to feed the output directly into a light-field 3D device such as Looking Glass (LG) Portrait. In order to simplify the acquisition of a Desktop screen area by the user, a multi-platform Graphical User Interface has been also implemented. Sources available at: https://github.com/canessae/altiro3D/releases/tag/2.0.0

[100] GATE: Geometry-Aware Trained Encoding

Jakub Bokšanský,Daniel Meister,Carsten Benthin

Main category: cs.GR

TL;DR: 提出了一种名为GATE的新型几何感知编码方法，将特征向量存储在三角网格表面，解决了哈希编码的局限性。

Details

Motivation: 输入参数编码是神经网络算法的基本组成部分，其目标是将输入数据映射到高维空间，但现有哈希编码存在碰撞、分辨率选择等问题。 Method: GATE编码将特征向量存储在三角网格表面，利用网格颜色解耦特征向量密度与几何密度，支持更精细的训练控制和自适应细节层次。 Result: 该方法适用于神经渲染相关算法（如神经辐射缓存），避免了哈希碰撞、分辨率选择问题和内存访问不一致性。 Conclusion: GATE编码提供了一种高效且灵活的输入参数编码方案，优于传统哈希编码方法。 Abstract: The encoding of input parameters is one of the fundamental building blocks of neural network algorithms. Its goal is to map the input data to a higher-dimensional space, typically supported by trained feature vectors. The mapping is crucial for the efficiency and approximation quality of neural networks. We propose a novel geometry-aware encoding called GATE that stores feature vectors on the surface of triangular meshes. Our encoding is suitable for neural rendering-related algorithms, for example, neural radiance caching. It also avoids limitations of previous hash-based encoding schemes, such as hash collisions, selection of resolution versus scene size, and divergent memory access. Our approach decouples feature vector density from geometry density using mesh colors, while allowing for finer control over neural network training and adaptive level-of-detail.

[101] Solving partial differential equations in participating media

Bailey Miller,Rohan Sawhney,Keenan Crane,Ioannis Gkioulekas

Main category: cs.GR

TL;DR: 提出两种新算法，用于在复杂微粒子几何中高效求解线性椭圆PDE，比传统方法更准确高效。

Details

Motivation: 解决复杂微粒子几何中PDE显式建模困难的问题，通过统计特性简化模型。 Method: 将域视为参与介质，基于指数介质特性开发两种蒙特卡洛算法：volumetric walk on spheres和volumetric walk on stars。 Result: 新算法在拉普拉斯边界值问题中比传统方法（如集合平均和均匀化）更准确高效。 Conclusion: 提出的算法为复杂微粒子几何中的PDE求解提供了高效且无离散化的解决方案。 Abstract: We consider the problem of solving partial differential equations (PDEs) in domains with complex microparticle geometry that is impractical, or intractable, to model explicitly. Drawing inspiration from volume rendering, we propose tackling this problem by treating the domain as a participating medium that models microparticle geometry stochastically, through aggregate statistical properties (e.g., particle density). We first introduce the problem setting of PDE simulation in participating media. We then specialize to exponential media and describe the properties that make them an attractive model of microparticle geometry for PDE simulation problems. We use these properties to develop two new algorithms, volumetric walk on spheres and volumetric walk on stars, that generalize previous Monte Carlo algorithms to enable efficient and discretization-free simulation of linear elliptic PDEs (e.g., Laplace) in participating media. We demonstrate experimentally that our algorithms can solve Laplace boundary value problems with complex microparticle geometry more accurately and more efficiently than previous approaches, such as ensemble averaging and homogenization.

[102] Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos

Weikun Peng,Jun Lv,Cewu Lu,Manolis Savva

Main category: cs.GR

TL;DR: 提出了一种从手持相机拍摄的动态RGBD视频中重建关节物体的方法，解决了现有方法对数据要求高的问题。

Details

Motivation: 关节物体在日常生活中普遍存在，但现有方法需要精心捕获的数据，限制了其实际应用和扩展性。 Method: 采用粗到细的框架，从动态RGBD视频中推断关节参数并分割可移动部件。 Result: 在合成和真实数据集上显著优于现有方法，能够跨类别重建关节物体。 Conclusion: 该方法在动态RGBD视频中重建关节物体方面具有显著优势，适用于实际应用。 Abstract: Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.

[103] Complex-Valued Holographic Radiance Fields

Yicheng Zhan,Dong-Ha Shin,Seung-Hwan Baek,Kaan Akşit

Main category: cs.GR

TL;DR: 提出了一种基于复值高斯基元的新型3D全息场景表示方法，显著提升了渲染速度。

Details

Motivation: 为了在3D表示中完整建模光的振幅和相位特性，以推动物理可信的渲染（尤其是全息显示），需要一种不依赖强度中间体的优化方法。 Method: 通过复值高斯基元重新定义3D高斯泼溅，利用RGBD多视图图像直接优化复值高斯作为3D全息场景表示。 Result: 相比现有方法，速度提升30倍至10,000倍，同时保持图像质量。 Conclusion: 该方法首次实现了几何对齐且物理可信的全息场景表示，为相关领域提供了重要进展。 Abstract: Modeling the full properties of light, including both amplitude and phase, in 3D representations is crucial for advancing physically plausible rendering, particularly in holographic displays. To support these features, we propose a novel representation that optimizes 3D scenes without relying on intensity-based intermediaries. We reformulate 3D Gaussian splatting with complex-valued Gaussian primitives, expanding support for rendering with light waves. By leveraging RGBD multi-view images, our method directly optimizes complex-valued Gaussians as a 3D holographic scene representation. This eliminates the need for computationally expensive hologram re-optimization. Compared with state-of-the-art methods, our method achieves 30x-10,000x speed improvements while maintaining on-par image quality, representing a first step towards geometrically aligned, physically plausible holographic scene representations.

[104] Fine-Grained Spatially Varying Material Selection in Images

Julia Guerrero-Viu,Michael Fischer,Iliyan Georgiev,Elena Garces,Diego Gutierrez,Belen Masia,Valentin Deschaintre

Main category: cs.GR

TL;DR: 提出了一种基于视觉Transformer（ViT）的材料选择方法，通过多分辨率处理策略提升选择精度，并支持纹理和子纹理两级选择。

Details

Motivation: 图像编辑中材料选择是关键步骤，但现有方法对光照和反射变化不够鲁棒，影响下游编辑效果。 Method: 利用ViT模型特征，结合多分辨率处理策略，提出两级材料选择方法（DuMaS），并基于包含80万张合成图像的数据集进行验证。 Result: 方法在纹理和子纹理级别上实现了更精细和稳定的选择效果，优于现有方法。 Conclusion: 该方法为图像编辑提供了更鲁棒的材料选择工具，支持更复杂的下游任务。 Abstract: Selection is the first step in many image editing processes, enabling faster and simpler modifications of all pixels sharing a common modality. In this work, we present a method for material selection in images, robust to lighting and reflectance variations, which can be used for downstream editing tasks. We rely on vision transformer (ViT) models and leverage their features for selection, proposing a multi-resolution processing strategy that yields finer and more stable selection results than prior methods. Furthermore, we enable selection at two levels: texture and subtexture, leveraging a new two-level material selection (DuMaS) dataset which includes dense annotations for over 800,000 synthetic images, both on the texture and subtexture levels.

cs.CL [Back]

[105] Conservative Bias in Large Language Models: Measuring Relation Predictions

Toyin Aguda,Erik Wilson,Allan Anzagira,Simerjot Kaur,Charese Smiley

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）在关系抽取任务中表现出明显的保守偏见，倾向于选择无关系标签（No_Relation），导致信息丢失。研究发现保守偏见的发生频率是幻觉的两倍。

Details

Motivation: 研究LLMs在关系抽取任务中的保守偏见行为，及其对信息提取的影响。 Method: 通过多种提示、数据集和关系类型系统评估保守偏见，引入Hobson's choice概念，并使用SBERT和LLM提示量化语义相似性。 Result: 保守偏见的发生频率是幻觉的两倍，且在约束提示和无约束提示中表现出语义相似性。 Conclusion: 保守偏见虽能减少错误关系分配，但也导致显著信息丢失，需在模型设计中权衡。 Abstract: Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson's choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.

[106] QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Jacob Dineen,Aswin RRV,Qin Liu,Zhikun Xu,Xiao Ye,Ming Shen,Zhaonan Li,Shijie Lu,Chitta Baral,Muhao Chen,Ben Zhou

Main category: cs.CL

TL;DR: QA-LIGN是一种自动符号奖励分解方法，通过为每个原则生成独立奖励组件，提升语言模型对齐的透明度和适应性，性能不逊于DPO基线。

Details

Motivation: 现有奖励对齐方法将多样反馈压缩为单一标量奖励，导致目标纠缠和可解释性差，需改进。 Method: QA-LIGN通过生成原则特定评估问题，分解奖励为独立组件，替代传统黑盒奖励模型。 Result: 实验表明，QA-LIGN在透明度和适应性上优于传统方法，性能与DPO基线相当或更好。 Conclusion: QA-LIGN为语言模型对齐提供了更可解释和可控的解决方案，且不牺牲任务性能。 Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.

[107] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu,Yinzhu Quan

Main category: cs.CL

TL;DR: EconWebArena是一个用于评估自主代理在复杂、多模态经济任务中表现的基准测试，包含360个任务，覆盖多个经济领域，强调权威数据源和基于网络的推理。

Details

Motivation: 现有基准测试缺乏对权威数据源和多模态经济任务的关注，EconWebArena旨在填补这一空白。 Method: 通过大型语言模型生成候选任务，并经过人工筛选，确保任务的清晰性、可行性和数据可靠性。评估了多模态LLM作为网络代理的表现。 Result: 评估揭示了性能差距，尤其是在基础、导航和多模态理解方面存在挑战。 Conclusion: EconWebArena为经济网络智能提供了一个严格的测试平台，突出了当前技术的局限性。 Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

Muhammad Usman,Muhammad Ahmad,M. Shahiki Tash,Irina Gelbukh,Rolando Quintero Tellez,Grigori Sidorov

Main category: cs.CL

TL;DR: 论文提出了一种基于注意力层和大型语言模型的多语言仇恨言论检测方法，并在英语、西班牙语和乌尔都语数据集上取得了显著性能提升。

Details

Motivation: 社交媒体上的仇恨言论威胁在线安全和包容性，但乌尔都语的仇恨言论检测研究较少，尤其是基于翻译的方法。 Method: 使用注意力层增强特征提取，结合GPT-3.5 Turbo和Qwen 2.5 72B等模型，并与传统机器学习方法（如SVM）对比。 Result: 在英语、西班牙语和乌尔都语上分别实现了8.75%、8.97%和5.19%的性能提升，多语言模型F1分数达0.88。 Conclusion: 该方法为多语言仇恨言论检测提供了有效解决方案，有助于构建更安全的数字社区。 Abstract: Social media platforms are critical spaces for public discourse, shaping opinions and community dynamics, yet their widespread use has amplified harmful content, particularly hate speech, threatening online safety and inclusivity. While hate speech detection has been extensively studied in languages like English and Spanish, Urdu remains underexplored, especially using translation-based approaches. To address this gap, we introduce a trilingual dataset of 10,193 tweets in English (3,834 samples), Urdu (3,197 samples), and Spanish (3,162 samples), collected via keyword filtering, with a balanced distribution of 4,849 Hateful and 5,344 Not-Hateful labels. Our methodology leverages attention layers as a precursor to transformer-based models and large language models (LLMs), enhancing feature extraction for multilingual hate speech detection. For non-transformer models, we use TF-IDF for feature extraction. The dataset is benchmarked using state-of-the-art models, including GPT-3.5 Turbo and Qwen 2.5 72B, alongside traditional machine learning models like SVM and other transformers (e.g., BERT, RoBERTa). Three annotators, following rigorous guidelines, ensured high dataset quality, achieving a Fleiss' Kappa of 0.821. Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B). These results reflect improvements of 8.75% in English (over SVM baseline 0.80), 8.97% in Spanish (over SVM baseline 0.78), 5.19% in Urdu (over SVM baseline 0.77), and 7.32% in the joint multilingual model (over SVM baseline 0.82). Our framework offers a robust solution for multilingual hate speech detection, fostering safer digital communities worldwide.

[109] ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding

Lijing Zhu,Qizhen Lan,Qing Tian,Wenbo Sun,Li Yang,Lu Xia,Yixin Xie,Xi Xiao,Tiehang Duan,Cui Tao,Shuteng Niu

Main category: cs.CL

TL;DR: ETT-CKGE提出了一种高效的任务驱动令牌方法，解决了现有CKGE方法在知识保留和计算效率上的不足。

Details

Motivation: 现有CKGE方法因手动设计的节点/关系重要性评分和计算效率低而表现不佳。 Method: 引入可学习的任务驱动令牌，避免显式节点评分或遍历，通过矩阵操作实现知识转移。 Result: 在六个基准数据集上表现优异，显著提升了训练效率和可扩展性。 Conclusion: ETT-CKGE在性能和效率上均优于现有方法，代码已开源。 Abstract: Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main

[110] Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction

Gerardo Aleman Manzanarez,Nora de la Cruz Arana,Jorge Garcia Flores,Yobany Garcia Medina,Raul Monroy,Nathalie Pernelle

Main category: cs.CL

TL;DR: 论文提出了一种基于文学理论的评估协议GrAImes，用于客观评估AI生成的微小说的文学价值，并通过专家和爱好者验证了其有效性。

Details

Motivation: 尽管AI能生成叙事一致且语言连贯的短篇小说，但其文学价值（尤其是美学质量）的评估缺乏系统研究。 Method: 提出GrAImes评估协议，结合文学理论，从主题连贯性、文本清晰度、解释深度和美学质量等多方面评估。 Result: 验证了GrAImes协议的有效性，得到了文学专家和爱好者的认可。 Conclusion: GrAImes为AI生成微小说的文学价值评估提供了客观框架，填补了研究空白。 Abstract: Automated story writing has been a subject of study for over 60 years. Large language models can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs for literary merit - especially concerning aesthetic qualities - has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfictions and argue that this task requires consideration of literary criteria across various aspects of the text, such as thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory, specifically drawing from a literary perspective, to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol, as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfictions and assessing their literary value.

[111] LLM-BT: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding

Li Weigang,Pedro Carvalho Brom

Main category: cs.CL

TL;DR: LLM-BT利用大语言模型（LLMs）实现跨语言术语标准化，通过回译框架验证术语一致性，支持多路径验证，并将回译视为动态语义嵌入。

Details

Motivation: 传统专家驱动的术语标准化方法在快速发展的领域（如AI和量子计算）中难以应对多语言一致性挑战，需要自动化解决方案。 Method: 提出LLM-BT框架，结合术语级一致性验证、多路径验证工作流（串行和并行回译路径）以及动态语义嵌入概念。 Result: 实验显示术语一致性超过90%，跨语言鲁棒性强（BLEU>0.45，葡萄牙语准确率100%）。 Conclusion: LLM-BT为多语言术语标准化提供高效工具，实现人机协作，支持全球科技领域的术语治理。 Abstract: The rapid growth of English technical terms challenges traditional expert-driven standardization, especially in fast-evolving fields like AI and quantum computing. Manual methods struggle to ensure multilingual consistency. We propose \textbf{LLM-BT}, a back-translation framework powered by large language models (LLMs) to automate terminology verification and standardization via cross-lingual semantic alignment. Our contributions are: \textbf{(1) Term-Level Consistency Validation:} Using English $\rightarrow$ intermediate language $\rightarrow$ English back-translation, LLM-BT achieves high term consistency across models (e.g., GPT-4, DeepSeek, Grok), with case studies showing over 90\% exact or semantic matches. \textbf{(2) Multi-Path Verification Workflow:} A novel ``Retrieve--Generate--Verify--Optimize'' pipeline integrates serial (e.g., EN $\rightarrow$ ZHcn $\rightarrow$ ZHtw $\rightarrow$ EN) and parallel (e.g., EN $\rightarrow$ Chinese/Portuguese $\rightarrow$ EN) BT routes. BLEU and term accuracy indicate strong cross-lingual robustness (BLEU $>$ 0.45; Portuguese accuracy 100\%). \textbf{(3) Back-Translation as Semantic Embedding:} BT is conceptualized as dynamic semantic embedding, revealing latent meaning trajectories. Unlike static embeddings, LLM-BT provides transparent path-based embeddings shaped by model evolution. LLM-BT transforms back-translation into an active engine for multilingual terminology standardization, enabling human--AI collaboration: machines ensure semantic fidelity, humans guide cultural interpretation. This infrastructure supports terminology governance across scientific and technological fields worldwide.

[112] Unable to forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length

Chupei Wang,Jiaqiu Vince Sun

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLM）中信息检索与生成能力的关联，发现上下文干扰会影响检索准确性，并提出了一种评估方法PI-LLM。

Details

Motivation: 探索LLM中信息检索的干扰效应，尤其是主动干扰（PI）对检索准确性的影响。 Method: 引入PI-LLM评估方法，通过顺序流式传输语义相关的键值更新并查询最终值，分析干扰积累对检索的影响。 Result: LLM检索准确性随干扰积累呈对数线性下降，提示工程缓解效果有限。 Conclusion: LLM在干扰分离和信息灵活处理上存在工作记忆瓶颈，需增强模型抑制无关内容的能力。 Abstract: Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.

[113] "I Wrote, I Paused, I Rewrote" Teaching LLMs to Read Between the Lines of Student Writing

Samra Zafar,Shaheer Minhas,Syed Ali Hassan Zaidi,Arfa Naeem,Zahra Ali

Main category: cs.CL

TL;DR: 研究探讨了通过记录写作过程数据（如击键记录和定期快照）来改进大型语言模型（LLM）对学生写作反馈的效果。结果显示，学生更倾向于基于写作过程的反馈，认为其更贴合自身思维。

Details

Motivation: 当前LLM的反馈仅基于最终文本，缺乏对写作过程的了解，可能无法准确反映学生的思考与修订过程。 Method: 开发了一个数字写作工具，记录学生的打字内容和文章演变过程。20名学生使用该工具完成限时写作，LLM基于最终文本和完整写作痕迹生成反馈，学生随后评估反馈的有用性和相关性。 Result: 学生更偏好基于写作过程的反馈，且某些编辑行为（如添加内容或重组段落）与更高分数相关。 Conclusion: 让LLM更了解写作过程可以生成更有意义、个性化和支持性的反馈。 Abstract: Large language models(LLMs) like Gemini are becoming common tools for supporting student writing. But most of their feedback is based only on the final essay missing important context about how that text was written. In this paper, we explore whether using writing process data, collected through keystroke logging and periodic snapshots, can help LLMs give feedback that better reflects how learners think and revise while writing. We built a digital writing tool that captures both what students type and how their essays evolve over time. Twenty students used this tool to write timed essays, which were then evaluated in two ways: (i) LLM generated feedback using both the final essay and the full writing trace, and (ii) After the task, students completed surveys about how useful and relatable they found the feedback. Early results show that learners preferred the process-aware LLM feedback, finding it more in tune with their own thinking. We also found that certain types of edits, like adding new content or reorganizing paragraphs, aligned closely with higher scores in areas like coherence and elaboration. Our findings suggest that making LLMs more aware of the writing process can lead to feedback that feels more meaningful, personal, and supportive.

[114] Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions

Yu-Ang Lee,Guan-Ting Yi,Mei-Yi Liu,Jui-Chao Lu,Guan-Bo Yang,Yun-Nung Chen

Main category: cs.CL

TL;DR: 本文系统回顾了复合AI系统优化的最新进展，包括数值和基于语言的技术，并提出了未来研究方向。

Details

Motivation: 随着复合AI系统的复杂性增加，优化组件及其交互面临新挑战，需要探索新方法。 Method: 通过形式化复合AI系统优化的概念，分类现有方法，并分析自然语言反馈等新技术的潜力。 Result: 总结了数值和语言技术在优化复合AI系统中的应用，并指出开放研究问题。 Conclusion: 复合AI系统优化是一个快速发展的领域，未来需进一步探索新方法以应对复杂性挑战。 Abstract: Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.

[115] Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Shashidhar Reddy Javaji,Yupeng Cao,Haohang Li,Yangyang Yu,Nikhil Muralidhar,Zining Zhu

Main category: cs.CL

TL;DR: CLAIM-BENCH是一个评估大语言模型（LLMs）在科学论文中提取和验证声明与证据能力的基准测试，揭示了LLMs在处理复杂科学内容时的局限性，并提出了改进方法。

Details

Motivation: 研究动机是探索LLMs是否能够真正理解和处理复杂研究论文中的逻辑关系，尤其是声明与支持证据之间的联系。 Method: 方法包括系统比较三种基于分治策略的方法，并在六种LLMs上进行测试，评估了300多个声明-证据对。 Result: 结果显示闭源模型（如GPT-4和Claude）在精确度和召回率上优于开源模型，且三阶段和逐一提示方法能显著提升性能，但计算成本增加。 Conclusion: CLAIM-BENCH为评估LLMs的科学理解能力设定了新标准，并为构建更可靠的推理系统提供了工具和方向。 Abstract: Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

[116] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

Wanjing Anya Ma,Michael Flor,Zuowei Wang

Main category: cs.CL

TL;DR: 论文探讨了阅读理解中推理能力的重要性，提出了一种推理类型的分类法，并利用GPT-4o生成诊断性阅读理解题目，实验表明生成题目质量较高，但准确匹配目标推理类型的比例较低。

Details

Motivation: 诊断性阅读理解题目能帮助教育者为学生提供更有针对性的阅读指导和干预，但推理能力复杂多样，需系统分类和高效生成方法。 Method: 提出推理类型分类法，分析题目库分布；使用GPT-4o通过少量示例生成推理题目，比较有无思维链提示的效果。 Result: GPT-4o生成的题目93.8%质量良好，适合3-12年级使用，但仅42.6%准确匹配目标推理类型。 Conclusion: 结合自动生成与人工判断是扩展高质量诊断性阅读理解评估的有效途径。 Abstract: Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

[117] Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Matteo Cargnelutti,Catherine Brobston,John Hess,Jack Cushman,Kristi Mukk,Aristana Scourtas,Kyle Courtney,Greg Leppert,Amanda Watson,Martha Whitehead,Jonathan Zittrain

Main category: cs.CL

TL;DR: 报告介绍了Institutional Books 1.0数据集，基于哈佛图书馆的公共领域书籍，旨在为大语言模型提供高质量、多样化的训练数据。

Details

Motivation: 大语言模型的数据质量直接影响其性能，但目前高质量公开数据稀缺，需要可持续的数据管理实践。 Method: 提取、分析和处理哈佛图书馆的公共领域书籍，生成包含OCR文本和元数据的详细数据集。 Result: 发布了983,004卷公共领域书籍的文本和元数据，总计242B tokens。 Conclusion: 该数据集为历史文献提供了更易访问和使用的资源，支持人类和机器的需求。 Abstract: Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

[118] Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency

Chenlong Wang,Yuanning Feng,Dongping Chen,Zhaoyang Chu,Ranjay Krishna,Tianyi Zhou

Main category: cs.CL

TL;DR: NoWait通过禁用显式自反思标记（如“Wait”和“Hmm”）来减少推理过程中的冗余输出，显著缩短推理轨迹长度（27%-51%），同时保持模型性能。

Details

Motivation: 大型推理模型在复杂推理中常因过度思考产生冗余输出，影响效率。研究探讨显式自反思标记是否必要。 Method: 提出NoWait方法，在推理过程中抑制显式自反思标记（如“Wait”和“Hmm”）。 Result: 在10个基准测试中，NoWait将推理轨迹长度减少27%-51%，且不影响模型性能。 Conclusion: NoWait是一种即插即用的高效多模态推理解决方案。 Abstract: Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as "Wait" and "Hmm", is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.

[119] Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Yuxuan Zhou,Xien Liu,Chenwei Yan,Chen Ning,Xiao Zhang,Boxun Li,Xiangling Fu,Shijin Wang,Guoping Hu,Yu Wang,Ji Wu

Main category: cs.CL

TL;DR: 论文提出了一种基于Bloom分类法的多认知层次评估框架，用于评估大型语言模型（LLMs）在医学领域的表现，发现模型在认知复杂度增加时性能显著下降。

Details

Motivation: 探索LLMs在不同认知层次上的医学能力表现，填补现有研究的空白。 Method: 提出多认知层次评估框架，整合医学数据集并设计针对三个认知层次的任务，评估六种主流LLMs。 Result: 模型在认知复杂度增加时性能显著下降，模型规模在高认知层次中起关键作用。 Conclusion: 研究强调了提升LLMs在高认知层次医学能力的必要性，为开发适合实际医学应用的LLMs提供了见解。 Abstract: Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs' medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.

[120] Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu

Main category: cs.CL

TL;DR: 本文主张文本嵌入研究应超越表层语义，将隐式语义作为核心建模目标，并提出改进方向。

Details

Motivation: 当前文本嵌入模型主要关注表层语义，而语言学理论强调隐式语义的重要性，如语用、说话者意图和社会文化背景。现有模型在这些方面表现不佳。 Method: 通过试点研究揭示现有模型在隐式语义任务上的不足，并提出改进建议。 Result: 试点研究表明，即使是先进模型在隐式语义任务上表现仅略优于简单基线。 Conclusion: 呼吁研究社区采用更丰富的训练数据、设计更深入的评估基准，并将隐式语义明确作为建模目标。 Abstract: This position paper argues that the text embedding research community should move beyond surface meaning and embrace implicit semantics as a central modeling goal. Text embedding models have become foundational in modern NLP, powering a wide range of applications and drawing increasing research attention. Yet, much of this progress remains narrowly focused on surface-level semantics. In contrast, linguistic theory emphasizes that meaning is often implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current embedding models are typically trained on data that lacks such depth and evaluated on benchmarks that reward the capture of surface meaning. As a result, they struggle with tasks requiring interpretive reasoning, speaker stance, or social meaning. Our pilot study highlights this gap, showing that even state-of-the-art models perform only marginally better than simplistic baselines on implicit semantics tasks. To address this, we call for a paradigm shift: embedding research should prioritize more diverse and linguistically grounded training data, design benchmarks that evaluate deeper semantic understanding, and explicitly frame implicit meaning as a core modeling objective, better aligning embeddings with real-world language complexity.

[121] DEAL: Disentangling Transformer Head Activations for LLM Steering

Li-Ming Zhan,Bo Liu,Zexin Lu,Chengqiang Xie,Jiannong Cao,Xiao-Ming Wu

Main category: cs.CL

TL;DR: 提出了一种基于因果归因的框架，用于识别Transformer中与行为相关的注意力头，通过VQ-AE量化行为相关和不相关子空间，提升推理时干预的准确性。

Details

Motivation: 当前模块选择方法依赖表面线索或临时启发式方法，可能导致次优或意外结果，需更原则性的方法识别行为相关模块。 Method: 训练VQ-AE量化注意力头的激活，划分行为相关/不相关子空间，通过二元分类指标评估行为相关性得分。 Result: 在七个LLM和五个行为数据集上验证，方法在真实性干预任务中表现优异，所选头在跨域场景中泛化能力强。 Conclusion: 提出的因果归因框架能更准确地识别行为相关注意力头，为推理时干预提供有效指导。 Abstract: Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

[122] CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs

Jash Rajesh Parekh,Pengcheng Jiang,Jiawei Han

Main category: cs.CL

TL;DR: CC-RAG通过整合零样本三元组提取和主题感知图链，改进了RAG模型在因果推理中的表现，优于标准RAG和零样本LLM。

Details

Motivation: 大型语言模型在因果推理方面存在挑战，尤其是在需要深层推理的专业领域。标准RAG缺乏对因果依赖的结构化建模。 Method: 提出CC-RAG方法，结合零样本三元组提取和主题感知图链，构建有向无环图（DAG）并进行前向/后向链式推理。 Result: 在比特币价格波动和高雪氏病两个领域的实验中，CC-RAG在链相似性、信息密度和词汇多样性上优于标准RAG和零样本LLM。 Conclusion: 显式建模因果结构能提升LLM生成答案的准确性和可解释性，尤其在专业领域中效果显著。 Abstract: Understanding cause and effect relationships remains a formidable challenge for Large Language Models (LLMs), particularly in specialized domains where reasoning requires more than surface-level correlations. Retrieval-Augmented Generation (RAG) improves factual accuracy, but standard RAG pipelines treat evidence as flat context, lacking the structure required to model true causal dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that integrates zero-shot triple extraction and theme-aware graph chaining into the RAG pipeline, enabling structured multi-hop inference. Given a domain specific corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of triples and uses forward/backward chaining to guide structured answer generation. Experiments on two real-world domains: Bitcoin price fluctuations and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot LLMs in chain similarity, information density, and lexical diversity. Both LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results demonstrate that explicitly modeling causal structure enables LLMs to generate more accurate and interpretable responses, especially in specialized domains where flat retrieval fails.

[123] Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding

Zikai Xiao,Ziyang Wang,Wen Ma,Yan Zhang,Wei Shen,Yan Wang,Luqi Gong,Zuozhu Liu

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的Positional Contrastive Decoding (PCD)方法，通过对比长文本和局部注意力生成的logits，缓解长文本性能退化问题，并在实验中取得最佳表现。

Details

Motivation: 大型语言模型（LLMs）在处理长上下文时性能下降，现有解决方案训练成本高，统计行为和低成本方法研究不足。研究发现Posterior Salience Attenuation (PSA)现象，即显著性比例与长文本性能退化相关。 Method: 提出PCD方法，通过对比长文本注意力和局部注意力生成的logits，利用大规模短到长训练的优势，缓解注意力分数退化。 Result: 实验证明PCD能有效缓解注意力分数退化，在长上下文基准测试中达到最优性能。 Conclusion: PCD是一种无需训练的高效方法，显著提升了LLMs在长上下文任务中的表现。 Abstract: While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.

[124] Draft-based Approximate Inference for LLMs

Kevin Galim,Ethan Ewer,Wonjun Kang,Minjae Lee,Hyung Il Koo,Kangwook Lee

Main category: cs.CL

TL;DR: 提出了一种利用小型草稿模型优化长上下文大语言模型（LLM）推理的新框架，包括SpecKV和SpecPC两种方法，显著提升了准确性和效率。

Details

Motivation: 由于Transformer的二次计算和线性内存复杂度，优化长上下文LLM推理变得至关重要。现有方法依赖粗略的预测，无法准确评估重要性。 Method: 提出SpecKV（利用草稿输出评估KV对重要性）和SpecPC（利用草稿模型注意力激活识别不重要提示词）两种方法。 Result: 实验表明，新方法在长上下文基准测试中准确性更高，同时保持内存、延迟和吞吐量的改进。 Conclusion: 草稿模型首次用于近似LLM推理加速，扩展了其应用范围，显著优于现有基线。 Abstract: Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

[125] EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models

Tao Zou,Xinghua Zhang,Haiyang Yu,Minzheng Wang,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 论文提出了一个名为EIFBENCH的复杂指令遵循基准测试，用于更真实地评估大语言模型（LLMs）在多任务和约束环境中的表现，并提出了SegPO算法以提升模型能力。

Details

Motivation: 现有基准测试局限于单任务环境，无法反映真实场景的复杂性，因此需要更全面的评估工具。 Method: 设计了EIFBENCH基准测试，包含多任务场景和多种约束，并提出SegPO算法优化LLM的多任务执行能力。 Result: 评估显示现有LLMs在复杂指令下表现差异显著，表明需要进一步优化。 Conclusion: EIFBENCH为LLMs的复杂应用提供了更真实的评估标准，SegPO算法有助于提升模型性能。 Abstract: With the development and widespread application of large language models (LLMs), the new paradigm of "Model as Product" is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.

[126] mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Luel Hagos Beyene,Vivek Verma,Min Ma,Jesujoba O. Alabi,Fabian David Schmidt,Joyce Nakatumba-Nabende,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 论文提出了mSTEB基准，用于评估LLMs在低资源语言上的表现，发现高资源与低资源语言间存在显著性能差距。

Details

Motivation: 解决低资源语言缺乏标准化评估基准的问题，填补LLMs在多语言任务中的评估空白。 Method: 引入mSTEB基准，涵盖语言识别、文本分类、问答和翻译任务，评估了Gemini 2.0 Flash、GPT-4o (Audio)等主流LLMs。 Result: 高资源与低资源语言（尤其是非洲和美洲/大洋洲语言）性能差距显著。 Conclusion: 需增加投资以解决低资源语言在LLMs中的代表性不足问题。 Abstract: Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.

[127] TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Weiya Li,Junjie Chen,Bei Li,Boyang Liu,Zichen Wen,Nuanqiao Shan,Xiaoqian Liu,Anping Liu,Huajie Liu,Youyan Wang,Wujiuge Yin,Hu Song,Bing Huang,Zhiyuan Xia,Jialiang Chen,Linfeng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为TACTIC的多智能体翻译框架，结合认知翻译理论，通过模拟人类翻译的认知过程提升翻译质量，实验表明其性能优于现有模型。

Details

Motivation: 尽管大语言模型在机器翻译中取得了显著进展，但其潜力尚未完全发挥。现有多智能体翻译框架忽视了认知翻译研究的关键见解，而这些见解对提升翻译质量至关重要。 Method: TACTIC框架包含六个功能各异的智能体，分别对应人类翻译的认知过程（如起草、精炼、评估等），通过模拟理论驱动的翻译流程提升翻译质量。 Result: 在FLORES-200和WMT24基准测试中，TACTIC表现优于GPT-4.1和DeepSeek-R1，平均提升分别为+0.6 XCOMET/+1.18 COMETKIWI-23和+0.84 XCOMET/+2.99 COMETKIWI-23。 Conclusion: TACTIC通过结合认知理论与多智能体协作，显著提升了机器翻译质量，为未来研究提供了新方向。 Abstract: Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.

[128] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Ziyang Ma,Qingyue Yuan,Zhenglin Wang,Deyu Zhou

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLM）的元认知能力评估，提出了AutoMeco框架和MIRA策略，以改进现有评估方法。

Details

Motivation: 现有研究多关注LLM的认知错误检测能力，但缺乏对其元认知能力（如自我意识）的深入分析，这对LLM的可靠性至关重要。 Method: 提出AutoMeco框架用于评估现有元认知指标，并设计MIRA策略（一种无需训练的马尔可夫内在奖励调整方法）以提升评估效果。 Result: 在三个数学推理数据集和三种LLM上的实验表明，AutoMeco的合理性，且MIRA能更准确地评估LLM的元认知能力。 Conclusion: AutoMeco和MIRA为LLM元认知能力评估提供了有效工具，有助于提升其可靠性。 Abstract: Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.

[129] Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Jiaxiang Liu,Boxuan Xing,Chenhao Yuan,Chenxiang Zhang,Di Wu,Xiusheng Huang,Haida Yu,Chuhan Lang,Pengfei Cao,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: Know-MRI是一个开源工具，旨在系统分析大语言模型（LLMs）的内部知识机制，通过可扩展的核心模块自动匹配输入数据与解释方法，支持多角度诊断。

Details

Motivation: 当前解释方法在输入数据格式和输出结果上存在差异，工具支持有限，限制了实际应用。Know-MRI旨在解决这些问题。 Method: 开发了一个可扩展的核心模块，能自动匹配输入数据与解释方法，并整合输出结果。 Result: Know-MRI支持用户根据输入自由选择解释方法，便于多角度全面诊断模型知识机制。 Conclusion: Know-MRI为LLMs的知识机制分析提供了灵活且实用的工具，代码和演示视频已公开。 Abstract: As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model's internal knowledge mechanisms from multiple perspectives. Our code is available at https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on https://youtu.be/NVWZABJ43Bs.

[130] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Ziqi. Liu,Ziyang. Zhou,Mingxuan. Hu

Main category: cs.CL

TL;DR: CAF-I是一个基于LLM的多智能体框架，通过多维分析和协作优化提升讽刺检测的准确性和可解释性，实验表现优于现有方法。

Details

Motivation: 现有LLM方法在讽刺检测中存在单视角限制、理解不足和缺乏可解释性的问题。 Method: CAF-I采用多智能体系统，包括上下文、语义和修辞智能体，进行多维分析，并通过决策和优化智能体整合结果。 Result: 在基准数据集上，CAF-I的零样本性能达到SOTA，Macro-F1平均76.31，比之前最佳基线提升4.98。 Conclusion: CAF-I通过模拟人类多视角分析，显著提升了讽刺检测的准确性和可解释性。 Abstract: Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I's state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

[131] Low-resource domain adaptation while minimizing energy and hardware resource consumption

Hernán Maina,Nicolás Wolovick,Luciana Benotti

Main category: cs.CL

TL;DR: 研究探讨了不同数值精度和数据并行化策略对LLM训练速度和模型精度的影响，旨在支持低资源环境下的领域适应。

Details

Motivation: 大型语言模型（LLM）训练成本高，且易受主流文化和价值观影响，领域适应虽能改善但对计算资源要求高。 Method: 评估不同数值精度和数据并行化策略对训练速度和模型精度的影响。 Result: 研究结果为能源效率、可访问性或硬件受限环境提供了实用参考。 Conclusion: 通过优化训练策略，可在低资源环境下实现有效的领域适应。 Abstract: Training Large Language Models (LLMs) is costly in terms of energy, hardware, and annotated data, often resulting in a positionality rooted in predominant cultures and values (Santy et al., 2023). Domain adaptation has emerged as a promising strategy to better align models with diverse cultural and value contexts (Hershcovich et al., 2022), but its computational cost remains a significant barrier, particularly for research groups lacking access to large-scale infrastructure. In this paper, we evaluate how the use of different numerical precisions and data parallelization strategies impacts both training speed (as a proxy to energy and hardware consumption) and model accuracy, with the goal of facilitating domain adaptation in low-resource environments. Our findings are relevant to any setting where energy efficiency, accessibility, or limited hardware availability are key concerns.

[132] Olica: Efficient Structured Pruning of Large Language Models without Retraining

Jiujun He,Huazhen Lin

Main category: cs.CL

TL;DR: Olica是一种无需重新训练的大型语言模型剪枝框架，通过正交分解和线性校准压缩模型，保持准确性。

Details

Motivation: 现有剪枝方法需要大量计算和数据资源重新训练，成本高昂。 Method: 利用PCA处理多头注意力层的矩阵乘积，并通过SVD解决最小二乘问题以减少误差。 Result: Olica在数据使用、GPU内存和运行时间上高效，且性能优于基准。 Conclusion: Olica提供了一种高效且无需重新训练的剪枝解决方案。 Abstract: Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

[133] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning

Fengjun Pan,Anh Tuan Luu,Xiaobao Wu

Main category: cs.CL

TL;DR: U-CoT+ 是一种新型框架，通过将视觉模因转换为文本描述，结合人类指导的零样本思维链提示，实现高效、灵活且可解释的有害模因检测。

Details

Motivation: 当前有害模因检测方法在资源效率、灵活性和可解释性方面存在不足，限制了实际部署。 Method: 开发高保真模因到文本转换管道，结合人类指导的零样本思维链提示，利用通用大语言模型进行分类。 Result: 在七个基准数据集上的实验验证了框架的有效性，展示了其在小规模大语言模型上的潜力。 Conclusion: U-CoT+ 提供了一种可解释且低资源的有害模因检测解决方案，具有高灵活性和适应性。 Abstract: Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models' reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.

[134] Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

Chihiro Taguchi,Seiji Maekawa,Nikita Bhutani

Main category: cs.CL

TL;DR: Adaptive-$k$ retrieval是一种单次自适应检索方法，动态选择上下文段落数量，提高问答效率与准确性。

Details

Motivation: 解决固定检索大小导致资源浪费或遗漏关键证据的问题，提升开放领域问答性能。 Method: 基于查询与候选段落相似度分布，自适应选择段落数量，无需微调或额外推理。 Result: 在事实性和聚合问答基准测试中，性能优于固定-$k$基线，节省90%的token，保留70%相关段落。 Conclusion: 动态调整上下文大小能更高效、准确地完成问答任务，适用于多种模型。 Abstract: Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.

[135] Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Huixuan Zhang,Xiaojun Wan

Main category: cs.CL

TL;DR: 本文指出现有文本到图像生成评估框架的不足，提出改进建议。

Details

Motivation: 现有评估主要依赖人类判断，忽略了评估框架的其他关键特性。 Method: 识别可靠评估的两个关键方面，实证分析主流框架的不足。 Result: 当前主流评估框架未能满足关键特性。 Conclusion: 提出改进图像-文本对齐评估的建议。 Abstract: Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.

[136] Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models

Sumanth Manduru,Carlotta Domeniconi

Main category: cs.CL

TL;DR: 论文首次大规模审计了参数规模在0.5至50亿之间的指令调优小型语言模型（SLMs），揭示了其公平性与性能的关系，为资源受限环境下的伦理部署提供指导。

Details

Motivation: 随着小型语言模型（SLMs）在资源受限设备上的快速部署，其伦理风险尚未被充分研究，本文旨在填补这一空白。 Method: 使用BBQ基准在零样本提示下评估了9个开源模型（Qwen 2.5、LLaMA 3.2、Gemma 3和Phi系列），分析其在模糊和明确上下文中的实用性和公平性。 Result: 发现性能与公平性可以共存（如Phi模型）；不同架构的偏见表现差异显著（如Qwen 2.5的虚假中立与LLaMA 3.2的刻板偏见）；量化压缩带来复杂权衡。 Conclusion: 研究为资源受限环境下SLMs的负责任部署提供了实用指导，特别有助于小型企业和资源受限环境。 Abstract: The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked "middle tier" between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.

[137] EtiCor++: Towards Understanding Etiquettical Bias in LLMs

Ashutosh Dwivedi,Siddhant Shivdutt Singh,Ashutosh Modi

Main category: cs.CL

TL;DR: 论文介绍了EtiCor++语料库，用于评估LLMs对全球礼仪的理解和偏见，并揭示了LLMs对某些地区的固有偏见。

Details

Motivation: 由于礼仪具有地域性且是文化的重要组成部分，研究LLMs的礼仪敏感性至关重要，但目前缺乏相关评估资源。 Method: 提出了EtiCor++语料库，设计了评估LLMs礼仪知识的任务，并引入了测量偏见的指标。 Result: 实验表明LLMs对某些地区存在固有偏见。 Conclusion: EtiCor++为评估LLMs的礼仪敏感性和偏见提供了重要资源，揭示了改进方向。 Abstract: In recent years, researchers have started analyzing the cultural sensitivity of LLMs. In this respect, Etiquettes have been an active area of research. Etiquettes are region-specific and are an essential part of the culture of a region; hence, it is imperative to make LLMs sensitive to etiquettes. However, there needs to be more resources in evaluating LLMs for their understanding and bias with regard to etiquettes. In this resource paper, we introduce EtiCor++, a corpus of etiquettes worldwide. We introduce different tasks for evaluating LLMs for knowledge about etiquettes across various regions. Further, we introduce various metrics for measuring bias in LLMs. Extensive experimentation with LLMs shows inherent bias towards certain regions.

[138] Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework

Xiao Wei,Xiaobao Wang,Ning Zhuang,Chenyang Wang,Longbiao Wang,Jianwu dang

Main category: cs.CL

TL;DR: 论文提出了一种基于一致性驱动的原型提示框架，用于广义意图发现（GID），通过整合新旧知识提升性能。

Details

Motivation: 现有GID方法仅关注无监督数据聚类，忽视了领域适应问题，限制了实际应用。 Method: 提出原型提示框架（利用外部知识）和分层一致性约束（学习目标域新知识）。 Result: 实验表明，该方法显著优于基线方法，达到最先进水平。 Conclusion: 该方法有效且具有泛化能力，代码已开源。 Abstract: Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.

[139] DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs

Arie Cattan,Alon Jacovi,Ori Ram,Jonathan Herzig,Roee Aharoni,Sasha Goldshtein,Eran Ofek,Idan Szpektor,Avi Caciularu

Main category: cs.CL

TL;DR: 本文提出了RAG中知识冲突的分类法，并建立了CONFLICTS基准，实验表明LLMs在处理冲突时表现不佳，但通过提示推理可改善。

Details

Motivation: 解决RAG中检索信息冲突的问题，明确模型应如何处理不同类型的冲突。 Method: 提出知识冲突分类法，构建CONFLICTS基准，并通过实验评估LLMs的表现。 Result: LLMs在处理冲突时表现不佳，但提示推理能显著提升响应质量。 Conclusion: 未来研究需进一步改进LLMs处理知识冲突的能力。 Abstract: Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains.

Divyaksh Shukla,Ritesh Baviskar,Dwijesh Gohil,Aniket Tiwari,Atul Shree,Ashutosh Modi

Main category: cs.CL

TL;DR: 论文介绍了CoMuMDR语料库，用于多模态多领域代码混合对话的篇章解析，并测试了现有模型，发现其表现不佳。

Details

Motivation: 当前篇章解析数据集局限于单一领域的书面英语对话，缺乏多模态和多领域代码混合的语料库。 Method: 构建了包含印地语和英语代码混合的多模态语料库CoMuMDR，并标注了九种篇章关系。 Result: 现有模型在CoMuMDR上表现不佳，凸显了多领域代码混合语料库的挑战。 Conclusion: 需要开发更好的模型以应对多模态多领域代码混合的篇章解析任务。 Abstract: Discourse parsing is an important task useful for NLU applications such as summarization, machine comprehension, and emotion recognition. The current discourse parsing datasets based on conversations consists of written English dialogues restricted to a single domain. In this resource paper, we introduce CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations. The corpus (code-mixed in Hindi and English) has both audio and transcribed text and is annotated with nine discourse relations. We experiment with various SoTA baseline models; the poor performance of SoTA models highlights the challenges of multi-domain code-mixed corpus, pointing towards the need for developing better models for such realistic settings.

[141] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Xinyuan Wang,Dongjie Wang,Wangyang Ying,Haoyue Bai,Nanxu Gong,Sixun Dong,Kunpeng Liu,Yanjie Fu

Main category: cs.CL

TL;DR: 提出一种轻量级后训练框架，通过对比推理反馈和残差嵌入细化改进潜在推理轨迹，显著提升模型推理能力。

Details

Motivation: 解决现有推理方法（如Chain-of-Thought）的固定推理轨迹和显式输出开销问题，以及潜在推理中如何高效更新推理嵌入的挑战。 Method: 使用对比推理反馈和残差嵌入细化策略，优化潜在推理轨迹。 Result: 在五个推理基准测试中表现优异，MathQA准确率提升5%。 Conclusion: 提出的框架有效提升了模型的推理能力，且无需额外训练。 Abstract: Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model's latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5\% accuracy gain on MathQA without additional training.

[142] Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?

Tuukka Törö,Antti Suni,Juraj Šimko

Main category: cs.CL

TL;DR: 该研究利用机器学习方法分析语音嵌入，探索106种世界语言的关系，发现嵌入距离与传统语言距离高度一致，为大规模语言分析提供了新途径。

Details

Motivation: 传统语言关系分析方法依赖专家劳动，而机器学习方法可以通过语音嵌入直接分析语言变异，实现大规模数据驱动的研究。 Method: 使用微调的XLS-R自监督语言识别模型提取语音嵌入，通过线性判别分析（LDA）聚类并与谱系、词汇和地理距离比较。 Result: 嵌入距离与传统语言距离高度一致，能捕捉全局和局部类型学模式，但可视化语言关系仍具挑战性。 Conclusion: 该方法为低资源语言研究提供了新视角，未来可扩展至未充分研究的语言并整合社会语言学变异。 Abstract: Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances. The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns. Challenges in visualizing relationships, particularly with hierarchical clustering and network-based methods, highlight the dynamic nature of language change. The findings show potential for scalable analyses of language variation based on speech embeddings, providing new perspectives on relationships among languages. By addressing methodological considerations such as corpus size and latent space dimensionality, this approach opens avenues for studying low-resource languages and bridging macro- and micro-level linguistic variation. Future work aims to extend these methods to underrepresented languages and integrate sociolinguistic variation for a more comprehensive understanding of linguistic diversity.

[143] CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling

Yahan Li,Jifan Yao,John Bosco S. Bunyi,Adam C. Frank,Angel Hwang,Ruishan Liu

Main category: cs.CL

TL;DR: CounselBench是一个大规模基准测试，用于评估LLMs在心理健康支持中的表现，发现LLMs在质量上优于人类治疗师，但存在安全隐患。

Details

Motivation: 测试LLMs在真实心理咨询场景中的行为，填补现有研究的空白。 Method: 开发CounselBench，包含专家评估和对抗性数据集，对LLMs进行多维度测试。 Result: LLMs在感知质量上优于人类，但常被专家标记为存在安全隐患；LLM评委易高估模型表现。 Conclusion: CounselBench为高风险的LLM心理健康应用提供了临床基准和改进框架。 Abstract: Large language models (LLMs) are increasingly proposed for use in mental health support, yet their behavior in realistic counseling scenarios remains largely untested. We introduce CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test LLMs in single-turn counseling. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of responses from GPT-4, LLaMA 3, Gemini, and online human therapists to real patient questions. Each response is rated along six clinically grounded dimensions, with written rationales and span-level annotations. We find that LLMs often outperform online human therapists in perceived quality, but experts frequently flag their outputs for safety concerns such as unauthorized medical advice. Follow-up experiments show that LLM judges consistently overrate model responses and overlook safety issues identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored counseling questions designed to trigger specific model issues. Evaluation across 2,880 responses from eight LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking and improving LLM behavior in high-stakes mental health settings.

[144] Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Liyan Xu,Zhenlin Su,Mo Yu,Jiangnan Li,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 本文研究文本编码器在识别细粒度实体或事件时的局限性，提出中文评估数据集CapRetrieval，并通过数据生成策略优化编码器性能。

Details

Motivation: 现有文本编码器在细粒度语义匹配上表现不佳，导致密集检索失败。 Method: 引入CapRetrieval数据集，提出数据生成策略对编码器进行微调。 Result: 微调后的编码器在CapRetrieval上表现最佳，但发现粒度困境问题。 Conclusion: 公开数据集、代码和模型，为细粒度语义匹配研究提供资源。 Abstract: This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

[145] Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

Shuzhou Yuan,Ercong Nie,Mario Tawfelis,Helmut Schmid,Hinrich Schütze,Michael Färber

Main category: cs.CL

TL;DR: 本文研究了MBTI人格特质对大型语言模型（LLMs）在仇恨言论检测中的影响，发现人格提示会导致显著差异，需谨慎设计提示以确保公平性。

Details

Motivation: 仇恨言论检测具有主观性，且人格特质对标注行为的影响尚未在LLMs中充分研究。本文旨在填补这一空白。 Method: 通过人类标注调查验证MBTI特质对标注的影响，并在四个开源LLMs中使用MBTI人格提示，评估其在三个仇恨言论数据集上的表现。 Result: 研究发现人格提示导致显著差异，包括与真实标签的不一致、人格间分歧以及逻辑层面的偏见。 Conclusion: 研究强调了在LLM标注流程中谨慎设计人格提示的重要性，以确保公平性和与人类价值观的一致性。 Abstract: Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.

[146] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Minhae Oh,Jeonghye Kim,Nakyung Lee,Donggeon Seo,Taeuk Kim,Jungwoo Lee

Main category: cs.CL

TL;DR: RAISE是一个分步检索增强框架，用于科学推理，通过问题分解、逻辑查询生成和逻辑检索三个步骤，显著优于其他基线方法。

Details

Motivation: 科学推理需要长链推理、领域术语知识和适应新发现，RAISE旨在解决这些挑战。 Method: RAISE分为三步：问题分解、逻辑查询生成和逻辑检索，从开放语料库中检索逻辑相关文档。 Result: RAISE在科学推理基准测试中表现优于其他基线方法，检索的文档不仅领域相似，逻辑相关性也更强。 Conclusion: RAISE通过逻辑相关文档检索，提升了科学推理的效果。 Abstract: Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.

[147] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models

Son The Nguyen,Theja Tulabandhula

Main category: cs.CL

TL;DR: MEMETRON框架通过离散黑盒优化问题重新定义LLM解码，利用混合元启发式算法GENETRON和ANNETRON，无需模型重训练即可高效发现高奖励响应。

Details

Motivation: 现有LLM解码方法（如贪婪搜索、采样或重排序）缺乏对任务特定目标的明确优化，限制了控制能力。 Method: MEMETRON将LLM解码视为离散黑盒优化问题，结合奖励模型和LLM的上下文操作，使用GENETRON和ANNETRON算法搜索响应空间。 Result: 在人类偏好对齐任务中，MEMETRON显著优于标准解码和重排序方法。 Conclusion: MEMETRON展示了无需模型重训练即可改进对齐的潜力，具有模块化和通用性。 Abstract: Large language models (LLMs) are increasingly used for both open-ended and structured tasks, yet their inference-time behavior is still largely dictated by heuristic decoding strategies such as greedy search, sampling, or reranking. These methods provide limited control and do not explicitly optimize for task-specific objectives. We introduce MEMETRON, a task-agnostic framework that formulates LLM decoding as a discrete black-box optimization problem. MEMETRON leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the response space, guided by reward models and contextual operations performed by the LLM itself. This approach enables efficient discovery of high-reward responses without requiring model retraining or gradient access. The framework is modular and generalizes across diverse tasks, requiring only a reward function and lightweight prompt templates. We evaluate our framework on the critical human preference alignment task and demonstrate that it significantly outperforms standard decoding and reranking methods, highlighting its potential to improve alignment without model retraining.

[148] TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Mingyu Zheng,Zhifan Feng,Jia Wang,Lanrui Wang,Zheng Lin,Yang Hao,Weiping Wang

Main category: cs.CL

TL;DR: TableDreamer是一个针对表格指令调优的渐进式、弱点引导的数据合成框架，解决了现有LLM方法在数据多样性和效率上的不足。

Details

Motivation: 现有LLM方法在生成表格指令调优数据时，未能充分探索输入空间且忽视目标LLM的弱点，导致数据多样性和效率不足。 Method: 通过合成多样化表格和指令作为种子数据，并在新发现的弱点数据引导下迭代探索输入空间，最终生成用于微调的训练数据。 Result: 在10个表格基准测试中，TableDreamer将Llama3.1-8B-instruct的平均准确率提升11.62%，优于现有基线方法。 Conclusion: TableDreamer通过渐进式和弱点引导的方法，显著提升了表格指令调优的数据质量和模型性能。 Abstract: Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer

[149] Summarization for Generative Relation Extraction in the Microbiome Domain

Oumaima El Khettari,Solen Quiniou,Samuel Chaffron

Main category: cs.CL

TL;DR: 生成式关系提取方法在肠道微生物组研究中表现潜力，但BERT模型仍占优。

Details

Motivation: 研究肠道微生物组这一复杂且资源匮乏的生物医学领域中的交互作用。 Method: 利用大型语言模型（LLMs）进行摘要生成以优化上下文，再通过指令调整生成关系。 Result: 摘要生成减少了噪声并提升了生成式关系提取性能，但BERT模型仍表现更优。 Conclusion: 生成式方法在低资源专业领域研究中具有潜力，但需进一步优化。 Abstract: We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.

[150] RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu,Jiaqi Li,Zilong Zheng

Main category: cs.CL

TL;DR: 论文提出了一种名为RuleReasoner的强化学习方法，用于提升小规模推理模型（SRMs）在规则推理任务中的表现，并通过动态采样实现跨领域泛化。

Details

Motivation: 解决小规模推理模型在多样化任务和领域中规则推理的泛化能力问题。 Method: 引入RuleReasoner方法，结合动态采样和强化学习，优化训练批次采样权重。 Result: 在分布内和分布外任务中显著优于前沿大规模推理模型，并具有更高的计算效率。 Conclusion: RuleReasoner为小规模推理模型提供了一种高效且泛化能力强的规则推理解决方案。 Abstract: Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.

[151] Brevity is the soul of sustainability: Characterizing LLM response lengths

Soham Poddar,Paramita Koley,Janardan Misra,Sanjay Podder,Navveen Balani,Niloy Ganguly,Saptarshi Ghosh

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）推理过程中的能源消耗问题，发现输出压缩是优化能源效率的关键方向。通过实验和评估，提出了通过提示工程策略减少响应长度的方法，实现了25-60%的能源优化。

Details

Motivation: LLMs的推理过程消耗大量能源，而输出压缩作为优化方向尚未充分探索。论文旨在通过减少响应长度来优化能源效率。 Method: 对12个仅解码器LLMs在5个数据集上进行基准测试，评估响应质量并定义六种信息类别。随后探索提示工程策略以减少响应长度。 Result: 实验表明，通过适当的提示策略，可以在保持响应质量的同时减少25-60%的能源消耗。 Conclusion: 提示工程策略能有效减少LLMs的响应长度，显著优化能源效率，为未来研究提供了方向。 Abstract: A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60\% by reducing the response length while preserving the quality of LLM responses.

[152] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Ruiran Su,Jiasheng Si,Zhijiang Guo,Janet B. Pierrehumbert

Main category: cs.CL

TL;DR: ClimateViz是首个大规模科学图表事实核查基准，包含49,862个与2,896个可视化图表关联的声明，评估了多模态语言模型在图表推理上的表现。

Details

Motivation: 科学事实核查主要关注文本和表格，忽略了科学图表的重要性。 Method: 引入ClimateViz数据集，评估多模态语言模型在零样本和少样本设置下的表现，并提供知识图谱解释。 Result: 当前模型在图表推理上表现不佳，最佳模型准确率仅76.2-77.8%，远低于人类水平（89.3-92.7%）。 Conclusion: 图表推理仍需改进，ClimateViz为未来研究提供了数据和工具支持。 Abstract: Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.

[153] ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization

Hee Suk Yoon,Eunseop Yoon,Mark A. Hasegawa-Johnson,Sungwoong Kim,Chang D. Yoo

Main category: cs.CL

TL;DR: ConfPO是一种专注于优化偏好关键令牌的偏好学习方法，无需辅助模型或额外计算，优于传统均匀调整方法。

Details

Motivation: 现有方法（如DPO）对所有令牌进行均匀调整，可能导致效率低下或过优化问题，ConfPO旨在通过针对性优化提升对齐质量。 Method: ConfPO基于训练策略的置信度识别并优化偏好关键令牌，避免使用额外模型或计算资源。 Result: 实验表明，ConfPO在多个基准测试中表现优于传统方法，且无额外计算开销。 Conclusion: ConfPO是一种高效、轻量级的偏好学习方法，显著提升对齐效果。 Abstract: We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.

[154] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Fariz Ikhwantri,Dusica Marijan

Main category: cs.CL

TL;DR: 论文提出了一种基于自然语言推理（NLI）的合规检测方法EXCLAIM，通过多跳推理实现可解释和可追踪的合规检测，并利用大型语言模型生成保证案例。

Details

Motivation: 复杂系统合规性检测面临法律与技术文本复杂、模型解释需求高及保证案例数据有限等挑战。 Method: 将保证案例的声明-论据-证据结构建模为多跳推理任务，利用大型语言模型生成保证案例，并引入覆盖率和结构一致性指标。 Result: 通过GDPR要求的案例研究验证了生成保证案例在多跳推理任务中的有效性。 Conclusion: NLI方法在自动化合规检测中具有潜力。 Abstract: Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

[155] Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition

Mehedi Hasan Bijoy,Dejan Porjazovski,Tamás Grósz,Mikko Kurimo

Main category: cs.CL

TL;DR: 提出了一种基于多教师知识蒸馏的多语言语音情感识别方法，显著提升了英语、芬兰语和法语的情感识别性能。

Details

Motivation: 尽管单语言语音情感识别已有进展，但构建多语言系统仍具挑战性。目标是训练一个能够处理多语言情感识别的统一模型。 Method: 采用语言感知的多教师知识蒸馏方法，以Wav2Vec2.0为基础构建单语言教师模型，并将其知识蒸馏至一个多语言学生模型。 Result: 学生模型在英语数据集上加权召回率达72.9，芬兰语数据集上未加权召回率达63.4，优于微调和知识蒸馏基线。 Conclusion: 该方法在提升悲伤和中性情感的识别上表现优异，但在愤怒和快乐情感的识别上仍有挑战。 Abstract: Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.

[156] Improved LLM Agents for Financial Document Question Answering

Nelvin Tan,Zian Seng,Liang Zhang,Yu-Ching Shih,Dong Yang,Amol Salunkhe

Main category: cs.CL

TL;DR: 论文探讨了在无标注数据情况下传统批评代理的性能下降问题，并提出改进的批评代理与计算器代理，性能优于现有方法且更安全。

Details

Motivation: 大型语言模型在金融文档数值问答任务中表现不佳，批评代理在无标注数据时性能下降，需改进。 Method: 提出改进的批评代理和计算器代理，分析代理间交互对性能的影响。 Result: 改进方法优于现有程序思维方法，且更安全。 Conclusion: 改进的批评代理与计算器代理在无标注数据时表现更优，代理间交互对性能有影响。 Abstract: Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

[157] Societal AI Research Has Become Less Interdisciplinary

Dror Kris Markus,Fabrizio Gilardi,Daria Stetsenko

Main category: cs.CL

TL;DR: 研究分析了10万篇AI论文，发现跨学科团队虽更倾向关注社会伦理，但纯计算机科学团队的社会伦理研究占比显著增加。

Details

Motivation: 探讨AI研究中社会伦理价值的整合情况，以及跨学科合作的实际作用。 Method: 开发分类器识别arXiv上2014-2024年的AI论文中的社会伦理内容，并分析团队构成与内容关联。 Result: 跨学科团队仍更关注社会伦理，但纯计算机科学团队的社会伦理研究占比显著上升，涉及公平、安全等多个领域。 Conclusion: 研究挑战了跨学科合作主导社会伦理AI的假设，引发对AI安全治理及社会科学学者角色的新思考。 Abstract: As artificial intelligence (AI) systems become deeply embedded in everyday life, calls to align AI development with ethical and societal values have intensified. Interdisciplinary collaboration is often championed as a key pathway for fostering such engagement. Yet it remains unclear whether interdisciplinary research teams are actually leading this shift in practice. This study analyzes over 100,000 AI-related papers published on ArXiv between 2014 and 2024 to examine how ethical values and societal concerns are integrated into technical AI research. We develop a classifier to identify societal content and measure the extent to which research papers express these considerations. We find a striking shift: while interdisciplinary teams remain more likely to produce societally-oriented research, computer science-only teams now account for a growing share of the field's overall societal output. These teams are increasingly integrating societal concerns into their papers and tackling a wide range of domains - from fairness and safety to healthcare and misinformation. These findings challenge common assumptions about the drivers of societal AI and raise important questions. First, what are the implications for emerging understandings of AI safety and governance if most societally-oriented research is being undertaken by exclusively technical teams? Second, for scholars in the social sciences and humanities: in a technical field increasingly responsive to societal demands, what distinctive perspectives can we still offer to help shape the future of AI?

[158] Towards Secure and Private Language Models for Nuclear Power Plants

Muhammad Anwar,Mishca de Costa,Issam Hammad,Daniel Lau

Main category: cs.CL

TL;DR: 本文介绍了一种针对核能应用的领域特定大语言模型，基于公开的Essential CANDU教材构建，采用紧凑的Transformer架构，单GPU训练以保护敏感数据。模型在专业核能词汇捕获上表现良好，但生成文本有时缺乏句法连贯性。

Details

Motivation: 开发一种符合严格网络安全和数据保密标准的内部大语言模型，专注于核能领域，以解决敏感数据保护和专业词汇需求。 Method: 基于Transformer架构，使用Essential CANDU教材作为训练数据，单GPU训练，专注于核能内容。 Result: 模型成功捕获专业核能词汇，但生成文本有时句法不连贯，展示了内部LLM解决方案的可行性。 Conclusion: 未来需扩展数据集、优化预处理和指令微调以提高领域准确性，并评估模型在核能领域实际应用的准备度。 Abstract: This paper introduces a domain-specific Large Language Model for nuclear applications, built from the publicly accessible Essential CANDU textbook. Drawing on a compact Transformer-based architecture, the model is trained on a single GPU to protect the sensitive data inherent in nuclear operations. Despite relying on a relatively small dataset, it shows encouraging signs of capturing specialized nuclear vocabulary, though the generated text sometimes lacks syntactic coherence. By focusing exclusively on nuclear content, this approach demonstrates the feasibility of in-house LLM solutions that align with rigorous cybersecurity and data confidentiality standards. Early successes in text generation underscore the model's utility for specialized tasks, while also revealing the need for richer corpora, more sophisticated preprocessing, and instruction fine-tuning to enhance domain accuracy. Future directions include extending the dataset to cover diverse nuclear subtopics, refining tokenization to reduce noise, and systematically evaluating the model's readiness for real-world applications in nuclear domain.

[159] Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data

Muhammad Anwar,Daniel Lau,Mishca de Costa,Issam Hammad

Main category: cs.CL

TL;DR: 论文探讨了如何通过合成数据生成解决核工业中非结构化文本数据的可用性问题，以支持大型语言模型的应用。

Details

Motivation: 核工业中存在大量非结构化的文本数据，但这些数据无法直接用于需要结构化问答对的LLM任务。合成数据可以填补这一空白。 Method: 利用LLM分析文本、提取关键信息、生成相关问题，并评估合成数据集的质量。 Result: 合成数据能够将现有文本转化为可用的问答对，支持LLM在核工业中的应用。 Conclusion: 合成数据为核工业中的信息检索、知识共享和决策提供了新的可能性。 Abstract: The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable Q&A pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.

[160] Factors affecting the in-context learning abilities of LLMs for dialogue state tracking

Pradyoth Hegde,Santosh Kesiraju,Jan Švec,Šimon Sedláček,Bolaji Yusuf,Oldřich Plchot,Deepak K T,Jan Černocký

Main category: cs.CL

TL;DR: 研究探讨了上下文学习（ICL）在对话状态跟踪（DST）中的应用，分析了影响其效果的因素，并基于句子嵌入的k近邻方法选择演示样本。

Details

Motivation: 探索ICL在DST中的有效性及其影响因素，以提升对话状态跟踪的性能。 Method: 使用句子嵌入的k近邻方法选择演示样本，结合测试样本构建模板输入LLM，系统分析演示选择和提示上下文对DST的影响。 Result: 在MultiWoZ2.4数据集上，针对OLMo-7B-instruct等模型，提供了关于LLM上下文学习能力的实用见解。 Conclusion: 研究揭示了LLM在对话状态跟踪中的上下文学习潜力，并提出了影响其性能的关键因素。 Abstract: This study explores the application of in-context learning (ICL) to the dialogue state tracking (DST) problem and investigates the factors that influence its effectiveness. We use a sentence embedding based k-nearest neighbour method to retrieve the suitable demonstrations for ICL. The selected demonstrations, along with the test samples, are structured within a template as input to the LLM. We then conduct a systematic study to analyse the impact of factors related to demonstration selection and prompt context on DST performance. This work is conducted using the MultiWoZ2.4 dataset and focuses primarily on the OLMo-7B-instruct, Mistral-7B-Instruct-v0.3, and Llama3.2-3B-Instruct models. Our findings provide several useful insights on in-context learning abilities of LLMs for dialogue state tracking.

[161] Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL

Mishca de Costa,Muhammad Anwar,Dave Mercier,Mark Randall,Issam Hammad

Main category: cs.CL

TL;DR: 论文提出了一种基于函数调用的大型语言模型（LLM）替代传统自然语言转SQL（NL-to-SQL）的方法，以提高核电厂数据检索的准确性和安全性。

Details

Motivation: 传统NL-to-SQL方法在核电厂等关键系统中存在验证困难和数据库复杂性问题，导致查询不准确和信任缺失。 Method: 通过预定义一组经过专家审核的专用函数，封装已验证的SQL逻辑，避免直接生成SQL查询。 Result: 相比直接NL-to-SQL，函数调用方法显著提高了查询准确性和可维护性。 Conclusion: 该框架在保证用户易用性的同时，提升了关键系统数据检索的安全性和可靠性。 Abstract: Retrieving operational data from nuclear power plants requires exceptional accuracy and transparency due to the criticality of the decisions it supports. Traditionally, natural language to SQL (NL-to-SQL) approaches have been explored for querying such data. While NL-to-SQL promises ease of use, it poses significant risks: end-users cannot easily validate generated SQL queries, and legacy nuclear plant databases -- often complex and poorly structured -- complicate query generation due to decades of incremental modifications. These challenges increase the likelihood of inaccuracies and reduce trust in the approach. In this work, we propose an alternative paradigm: leveraging function-calling large language models (LLMs) to address these challenges. Instead of directly generating SQL queries, we define a set of pre-approved, purpose-specific functions representing common use cases. Queries are processed by invoking these functions, which encapsulate validated SQL logic. This hybrid approach mitigates the risks associated with direct NL-to-SQL translations by ensuring that SQL queries are reviewed and optimized by experts before deployment. While this strategy introduces the upfront cost of developing and maintaining the function library, we demonstrate how NL-to-SQL tools can assist in the initial generation of function code, allowing experts to focus on validation rather than creation. Our study includes a performance comparison between direct NL-to-SQL generation and the proposed function-based approach, highlighting improvements in accuracy and maintainability. This work underscores the importance of balancing user accessibility with operational safety and provides a novel, actionable framework for robust data retrieval in critical systems.

[162] AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Ahmed Hasanaath,Aisha Alansari,Ahmed Ashraf,Chafik Salmane,Hamzah Luqman,Saad Ezzini

Main category: cs.CL

TL;DR: 该论文对大型语言模型（LLMs）在阿拉伯语数据上的推理能力进行了全面评估，重点测试了DeepSeek模型在15个阿拉伯语NLP任务中的表现，发现少量示例选择和微调策略显著提升性能。

Details

Motivation: 阿拉伯语因其丰富的形态、多样的方言和复杂的书写系统，LLMs在其上的表现尚未充分研究，论文旨在填补这一空白。 Method: 通过零样本、少量样本和微调（如LoRA）策略，系统评估了多个推理型LLMs在阿拉伯语任务中的表现。 Result: 关键发现包括：少量示例选择显著提升分类任务性能；DeepSeek模型在零样本设置下优于GPT o4-mini；LoRA微调比模型规模扩展更有效。 Conclusion: 研究表明，策略选择和模型架构对提升LLMs在阿拉伯语任务中的表现至关重要，为未来研究提供了实用指导。 Abstract: Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

[163] The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation

Francisco Vargas,Alejandro González Coene,Gaston Escalante,Exequiel Lobón,Manuel Pulido

Main category: cs.CL

TL;DR: 论文提出了一种从法律文档中提取交通事故信息的两步方法，通过文本分割和实体提取，比较了传统方法和基于语义搜索的现代方法，并验证了微调对减少幻觉的效果。

Details

Motivation: 从法律文档中提取交通事故信息对保险公司成本量化至关重要，但由于文档中复杂的论证和推理，即使是专家也难以完成。 Method: 采用两步法：先分割文档识别相关段落，再提取实体。比较了基于正则表达式的传统方法和基于语义搜索的现代方法，并测试了多种LLM（如LLaMA-2、LLaMA-3和GPT-4 Turbo）的性能。 Result: 基于语义搜索和LLM的方法显著优于传统方法（准确率39.5%）。微调后的LLaMA-2 70B达到79.4%，LLaMA-3 8B基础模型表现接近（76.6%），GPT-4 Turbo表现最佳（86.1%）。 Conclusion: 现代方法（尤其是结合语义搜索和微调LLM）能有效提升信息提取的准确性，LLaMA-3和GPT-4 Turbo展示了模型快速发展的潜力。 Abstract: The extraction of information about traffic accidents from legal documents is crucial for quantifying insurance company costs. Extracting entities such as percentages of physical and/or psychological disability and the involved compensation amounts is a challenging process, even for experts, due to the subtle arguments and reasoning in the court decision. A two-step procedure is proposed: first, segmenting the document identifying the most relevant segments, and then extracting the entities. For text segmentation, two methodologies are compared: a classic method based on regular expressions and a second approach that divides the document into blocks of n-tokens, which are then vectorized using multilingual models for semantic searches (text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models (LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to the selected segments for entity extraction. For the LLaMA models, fine-tuning is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a significant number of hallucinations in extractions which are an important contention point for named entity extraction. This work shows that these hallucinations are substantially reduced after finetuning the model. The performance of the methodology based on segment vectorization and subsequent use of LLMs significantly surpasses the classic method which achieves an accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%, surpassing its base version 61.7%. Notably, the base LLaMA-3 8B model already performs comparably to the finetuned LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.

[164] Advancing STT for Low-Resource Real-World Speech

Flavio D'Intino,Hans-Peter Hutter

Main category: cs.CL

TL;DR: 论文介绍了SRB-300数据集，一个300小时的瑞士德语语音语料库，用于改进低资源语言的语音识别模型。通过微调Whisper模型，显著提升了性能。

Details

Motivation: 瑞士德语缺乏标准化书写形式，现有数据集在自然对话语音中表现不佳，需要更真实的数据集来改进语音识别。 Method: 使用SRB-300数据集（包含真实环境中的长音频）微调多个Whisper模型。 Result: 微调后，词错误率（WER）降低19%-33%，BLEU分数提升8%-40%，最佳模型WER为17.1%，BLEU为74.8%。 Conclusion: SRB-300数据集和微调方法显著提升了瑞士德语语音识别性能，对低资源语言有重要意义。 Abstract: Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics. Improvements in word error rate (WER) ranged from 19% to 33%, while BLEU scores increased between 8% and 40%. The best fine-tuned model, large-v3, achieved a WER of 17.1% and a BLEU score of 74.8. This advancement is crucial for developing effective and robust STT systems for Swiss German and other low-resource languages in real-world contexts.

[165] AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna,Krishna Kumar,Basab Ghosh,Vinija Jain,Vasu Sharma,Aman Chadha,Amitava Das

Main category: cs.CL

TL;DR: 论文揭示了LLM对齐中的几何盲点，提出ALKALI基准和GRACE框架以应对潜在伪装攻击，显著降低攻击成功率。

Details

Motivation: 当前防御机制无法快速应对针对LLM的对抗性威胁，尤其是潜在伪装攻击。 Method: 引入ALKALI基准评估21种LLM，提出GRACE框架，结合偏好学习和潜在空间正则化。 Result: GRACE框架显著降低攻击成功率（ASR）达39%，并引入AVQI指标量化潜在对齐失败。 Conclusion: GRACE和AVQI为LLM安全提供了新工具，揭示了潜在伪装攻击的脆弱性。 Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.

[166] PlantBert: An Open Source Language Model for Plant Science

Hiba Khey,Amine Lakhder,Salma Rouichi,Imane El Ghabi,Kamal Hejjaoui,Younes En-nahli,Fahd Kalloubi,Moez Amri

Main category: cs.CL

TL;DR: PlantBert是一个基于DeBERTa架构的植物科学领域专用语言模型，专注于植物应激反应文献的结构化知识提取。

Details

Motivation: 植物科学领域缺乏针对性的语言模型工具，PlantBert旨在填补这一空白。 Method: 基于DeBERTa架构，结合规则增强的后期处理和本体基础实体规范化，使用专家标注的植物应激反应文献进行微调。 Result: PlantBert在实体识别和生物关系提取方面表现出色，适用于低资源科学领域。 Conclusion: PlantBert为农业NLP提供了可扩展的解决方案，推动了植物科学的数据驱动研究。 Abstract: The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantBert, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantBert is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantBert to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantBert exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, PlantBert bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.

[167] From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis

Elias Horner,Cristinel Mateis,Guido Governatori,Agata Ciabattoni

Main category: cs.CL

TL;DR: 提出了一种利用大语言模型（LLMs）自动分析法律文本语义的新方法，将其转化为可废止道义逻辑（DDL）的形式表示。通过结构化流程提取道义规则并评估其一致性，实验表明LLMs在有效提示下能显著提升法律信息学的可扩展性。

Details

Motivation: 解决法律文本语义分析的自动化问题，提升法律信息学的效率和可扩展性。 Method: 采用结构化流程，包括文本分段、道义规则提取及一致性评估，结合多种LLM配置（如提示工程、微调模型和多阶段流程）。 Result: 实验结果显示机器生成的形式化表示与专家手工结果高度一致，LLMs在有效提示下表现优异。 Conclusion: LLMs在自动化法律文本分析中具有潜力，尤其是在有效提示策略下，可显著提升法律信息学的可扩展性。 Abstract: We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs - particularly when prompted effectively - can significantly contribute to scalable legal informatics.

[168] Dialect Normalization using Large Language Models and Morphological Rules

Antonios Dimakis,John Pavlopoulos,Antonios Anastasopoulos

Main category: cs.CL

TL;DR: 提出了一种结合规则和大型语言模型的方法，用于希腊方言到标准语言的转换，无需平行数据，并通过人类评估验证效果。

Details

Motivation: 解决低资源语言（包括高资源语言的方言）在自然语言理解系统中的挑战，通过方言标准化提升下游工具的使用效果。 Method: 结合基于规则的语言学转换和大型语言模型（LLMs）的少样本提示方法，无需平行数据。 Result: 在希腊方言数据集上验证了方法的有效性，发现此前研究仅依赖表面语言信息，而新方法能保留更多语义信息。 Conclusion: 方言标准化方法有效，且能揭示此前研究未发现的语义信息。 Abstract: Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

[169] PropMEND: Hypernetworks for Knowledge Propagation in LLMs

Zeyu Leo Liu,Greg Durrett,Eunsol Choi

Main category: cs.CL

TL;DR: PropMEND是一种基于超网络的知识传播方法，通过修改语言建模损失的梯度来促进知识传播，显著提升了多跳问题的回答能力。

Details

Motivation: 现有知识编辑技术无法支持基于注入知识的推理问题，需要一种方法促进知识的传播。 Method: 采用超网络方法，扩展MEND的元目标，修改梯度更新以支持多跳问题回答。 Result: 在RippleEdit数据集上表现优异，多跳问题准确率提升近2倍；在Controlled RippleEdit数据集上仍优于现有方法。 Conclusion: PropMEND在知识传播方面表现突出，但在未见实体关系对上仍有改进空间，未来需进一步扩展知识传播范围。 Abstract: Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.

[170] Can A Gamer Train A Mathematical Reasoning Model?

Andrew Shin

Main category: cs.CL

TL;DR: 论文展示了一种在单个普通游戏GPU上训练高性能数学推理模型的方法，通过结合强化学习和内存优化技术，实现了资源受限环境下的高效训练。

Details

Motivation: 尽管大语言模型在数学推理等任务中表现出色，但其训练通常需要高昂的计算资源。本文旨在降低高性能AI研究的门槛。 Method: 结合强化学习和内存优化技术，在RTX 3080 Ti（16GB内存）上训练1.5B参数的数学推理模型。 Result: 该模型在数学推理基准测试中表现优于或与更大模型相当，且资源需求显著降低。 Conclusion: 研究挑战了高性能数学推理需要大规模基础设施的传统观念，为资源受限的研究者提供了新可能。 Abstract: While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.

[171] FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Qinggang Zhang,Zhishang Xiang,Yilin Xiao,Le Wang,Junhui Li,Xinrun Wang,Jinsong Su

Main category: cs.CL

TL;DR: FaithfulRAG是一个新框架，通过显式建模LLM参数知识与检索上下文之间的差异，解决知识冲突问题，优于现有方法。

Details

Motivation: 现有基于检索的LLM在处理知识冲突时存在不忠实问题，现有方法通过强制抑制模型参数知识实现忠实性，但损害了模型内部知识结构。 Method: FaithfulRAG在事实层面识别冲突知识，设计自思考过程，让LLM在生成响应前推理并整合冲突事实。 Result: 实验表明，FaithfulRAG优于现有最先进方法。 Conclusion: FaithfulRAG通过显式建模知识冲突，有效提升了LLM在知识密集型任务中的忠实性和性能。 Abstract: Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM`s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model`s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model`s parametric knowledge, which undermines the model`s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https:// github.com/DeepLearnXMU/Faithful-RAG

[172] Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions

Clara Lachenmaier,Judith Sieker,Sina Zarrieß

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在政治领域中如何处理共同基础，尤其是在面对错误信息时的表现。

Details

Motivation: 探究LLMs在知识不完整或存在错误信息时，如何通过对话达成共识，特别是在政治领域这种高风险场景。 Method: 通过直接知识问题和预设错误信息的引导性问题，评估LLMs是否能够纠正用户的错误信念。 Result: 研究发现LLMs在纠正错误信念和建立共同基础上存在显著挑战。 Conclusion: LLMs在政治对话中缓解错误信息的能力有限，需进一步改进。 Abstract: Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other's beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don't) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs' ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.

[173] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Marek Kadlčík,Michal Štefánik,Timothee Mickus,Michal Spiegel,Josef Kuchař

Main category: cs.CL

TL;DR: 预训练语言模型在算术任务中容易出错，现有方法难以准确探测数值嵌入。本文提出一种新探测技术，能近乎完美地从嵌入中解码数值，证明模型在预训练后能精确表示数字。

Details

Motivation: 现有方法未能有效捕捉语言模型中数值嵌入的正弦模式结构，导致对算术错误的解释不足。 Method: 提出一种新型探测技术，用于从输入嵌入中解码数值，验证模型对数字的精确表示能力。 Result: 新探测技术能近乎完美地解码数值，且嵌入的精确性与模型算术错误高度相关。通过调整嵌入模式可减少错误。 Conclusion: 语言模型在预训练后能精确表示数字，嵌入的精确性是算术错误的主要原因，调整嵌入模式可改善性能。 Abstract: Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.

[174] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Yuan Guo,Tingjia Miao,Zheng Wu,Pengzhou Cheng,Ming Zhou,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 论文介绍了UI-NEXUS基准测试和AGENT-NEXUS调度系统，用于评估和改进移动代理在组合任务上的性能。

Details

Motivation: 现有移动代理主要关注原子任务，而忽略了组合任务的实际需求，导致性能与效率难以平衡。 Method: 提出UI-NEXUS基准测试，包含三类组合操作，并开发AGENT-NEXUS调度系统动态分解任务为原子子任务。 Result: 实验显示现有代理在组合任务上表现不佳，而AGENT-NEXUS显著提升了任务成功率（24%至40%）。 Conclusion: AGENT-NEXUS有效填补了原子任务与组合任务间的性能差距，且未显著增加计算开销。 Abstract: Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.

[175] FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents

Satu Hopponen,Tomi Kinnunen,Alexandre Nikolaev,Rosa González Hautamäki,Lauri Tavi,Einar Meister

Main category: cs.CL

TL;DR: 介绍了FROST-EMA语料库，包含18名双语者的语音数据，用于研究语言变异性及其对技术和发音模式的影响。

Details

Motivation: 研究双语者在母语、第二语言及模仿第二语言（假口音）中的语音变异性。 Method: 构建FROST-EMA语料库，包含18名双语者的语音数据，并进行两项初步案例研究。 Result: 案例一显示第二语言及假口音对说话人验证系统性能的影响；案例二展示了一名说话人在不同语言模式下的发音特征。 Conclusion: FROST-EMA语料库为语音技术和发音研究提供了新资源。 Abstract: We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent). The new corpus enables research into language variability from phonetic and technological points of view. Accordingly, we include two preliminary case studies to demonstrate both perspectives. The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent.

Yuejiao Wang,Xianmin Gong,Xixin Wu,Patrick Wong,Hoi-lam Helene Fung,Man Wai Mak,Helen Meng

Main category: cs.CL

TL;DR: 该论文提出了一种基于自然语言任务的fMRI方法，用于早期检测老年认知衰退和神经认知障碍（NCD），并在97名非痴呆中国老年人中验证了其有效性。

Details

Motivation: 早期检测对预防和减缓老年人群中的神经认知障碍（NCD）进展至关重要，而语言相关的fMRI可能是一种有前景的方法。 Method: 研究设计了一种自然语言相关的fMRI任务，结合机器学习和人口统计学特征（年龄、性别、教育年限）进行分类。 Result: 分类模型的平均AUC为0.86，特征定位显示关键脑区与语言处理相关（如颞上回、颞中回和右小脑）。 Conclusion: 自然语言相关的fMRI任务在早期检测老年认知衰退和NCD方面具有潜力。 Abstract: Early detection is crucial for timely intervention aimed at preventing and slowing the progression of neurocognitive disorder (NCD), a common and significant health problem among the aging population. Recent evidence has suggested that language-related functional magnetic resonance imaging (fMRI) may be a promising approach for detecting cognitive decline and early NCD. In this paper, we proposed a novel, naturalistic language-related fMRI task for this purpose. We examined the effectiveness of this task among 97 non-demented Chinese older adults from Hong Kong. The results showed that machine-learning classification models based on fMRI features extracted from the task and demographics (age, gender, and education year) achieved an average area under the curve of 0.86 when classifying participants' cognitive status (labeled as NORMAL vs DECLINE based on their scores on a standard neurcognitive test). Feature localization revealed that the fMRI features most frequently selected by the data-driven approach came primarily from brain regions associated with language processing, such as the superior temporal gyrus, middle temporal gyrus, and right cerebellum. The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.

[177] Employing self-supervised learning models for cross-linguistic child speech maturity classification

Theo Zhang,Madurya Suresh,Anne S. Warlaumont,Kasia Hitczenko,Alejandrina Cristia,Margaret Cychosz

Main category: cs.CL

TL;DR: 论文提出了一种名为SpeechMaturity的新数据集，用于改进儿童语音分类任务，通过大规模、多语言的生态有效数据，显著提升了模型的分类性能。

Details

Motivation: 儿童语音技术系统因训练数据少和儿童语音的特殊性而表现不佳，需要更有效的数据集和方法。 Method: 使用SpeechMaturity数据集（包含242,004个标注的儿童发声样本）训练Transformer模型，分类儿童发声类型（如哭声、笑声、成熟语音等）。 Result: 模型在新数据集上表现优于现有技术，分类准确率接近人类水平，且在城乡环境中均表现稳健。 Conclusion: SpeechMaturity数据集和Transformer模型的结合为儿童语音分类任务提供了高效解决方案。 Abstract: Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.

[178] SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner

Lei Zhang,Jiaxi Yang,Min Yang,Jian Yang,Mouxiang Chen,Jiajun Zhang,Zeyu Cui,Binyuan Hui,Junyang Lin

Main category: cs.CL

TL;DR: SWE-Flow是一个基于测试驱动开发（TDD）的新型数据合成框架，通过单元测试自动推断增量开发步骤，生成结构化开发计划，并创建可验证的TDD任务。

Details

Motivation: 现有软件工程数据依赖人工提交的问题，而SWE-Flow直接从单元测试中推断开发步骤，更高效且自动化。 Method: 构建运行时依赖图（RDG）以捕获函数交互，生成分步开发计划，包括部分代码库、单元测试和代码修改。 Result: 从真实GitHub项目中生成16,061个训练实例和2,020个测试实例，实验显示该数据集显著提升TDD编码性能。 Conclusion: SWE-Flow为TDD研究提供高效工具，并公开了代码、数据集和模型以促进进一步研究。 Abstract: We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of **SWE-Flow** is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step *development schedule*. At each step, **SWE-Flow** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/SWE-Flow).

[179] UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags

Hakyung Sung,Gyu-Ho Shin,Chanyoung Lee,You Kyung Sung,Boo Kyung Jung

Main category: cs.CL

TL;DR: 研究扩展了第二语言韩语的通用依存标注工作，提出半自动化框架，通过XPOS序列识别形态句法结构并与UPOS对齐，同时扩充了L2韩语语料库。实验表明对齐数据能提升标注一致性和分析准确性。

Details

Motivation: 扩展L2韩语的通用依存标注工作，解决现有语料库不足及标注不一致问题。 Method: 引入半自动化框架，通过XPOS-UPOS对齐扩充语料库，并利用两种NLP工具包进行微调实验。 Result: 对齐数据集提升了标注一致性和形态句法分析准确性，尤其在标注数据有限时效果显著。 Conclusion: XPOS-UPOS对齐框架有效改善了L2韩语的形态句法分析性能。 Abstract: The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories. We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays. To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits. Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data.

[180] Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi,Xi Ye,Hao Tang,Zhigang Zhu,Eunsol Choi

Main category: cs.CL

TL;DR: 提出了一种名为SSA的紧凑型LLM，用于聚合多个样本答案，通过强化学习优化准确率，优于其他测试时扩展方法。

Details

Motivation: 利用多样本集提升大型语言模型在数学领域的性能。 Method: 训练一个紧凑型LLM（SSA），通过强化学习优化其对多样本答案的聚合能力。 Result: 在多个推理数据集上，SSA表现优于其他方法，并展现出良好的泛化能力。 Conclusion: SSA方法能高效利用黑盒模型的输出，提升答案准确性。 Abstract: Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

[181] Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features

Hakyung Sung,Karla Csuros,Min-Chang Sung

Main category: cs.CL

TL;DR: 研究比较了人类与LLM（ChatGPT-4o、Llama3.1-8b、Deepseek-r1-8b）在二语写作校对中的表现，发现两者均提升了词汇连贯性，但LLM更倾向于生成性修改。

Details

Motivation: 探讨人类与LLM在二语写作校对中的干预效果及一致性。 Method: 分析人类和三种LLM对相同二语写作的校对结果，评估词汇和句法特征。 Result: 人类和LLM均提升词汇连贯性，LLM修改更生成性，且三种模型结果高度一致。 Conclusion: LLM校对在词汇和句法修改上表现一致且生成性强，可能优于人类校对。 Abstract: This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.

[182] Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Haozhen Zhang,Tao Feng,Jiaxuan You

Main category: cs.CL

TL;DR: Router-R1是一个基于强化学习的多LLM路由框架，通过动态调用和整合多个LLM的能力，优化性能与成本的权衡。

Details

Motivation: 现有LLM路由器通常仅支持单轮一对一映射，无法充分利用多个LLM的互补优势处理复杂任务。 Method: Router-R1将路由和聚合建模为序列决策过程，利用LLM的推理能力交替进行内部思考和动态模型调用，并通过轻量级规则奖励指导学习。 Result: 在七个通用和多跳QA基准测试中，Router-R1表现优于多个基线，实现了性能与成本的优化。 Conclusion: Router-R1展示了通过强化学习优化多LLM路由的潜力，具备良好的泛化能力和成本管理。 Abstract: The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

[183] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Yaniv Nikankin,Dana Arad,Yossi Gandelsman,Yonatan Belinkov

Main category: cs.CL

TL;DR: 研究发现视觉-语言模型（VLMs）在处理视觉和文本任务时存在性能差距，通过分析计算子图（circuits）并提出一种无需训练的表示修补方法，成功缩小了部分差距。

Details

Motivation: 探究视觉-语言模型在视觉和文本任务中性能差距的原因，并提出改进方法。 Method: 比较不同模态的计算子图，分析数据表示差异，并提出将视觉数据表示从后期层修补到早期层的方法。 Result: 实验表明，该方法平均缩小了模态间性能差距的三分之一。 Conclusion: 揭示了多模态性能差距的原因，并提出了一种无需训练的有效改进方法。 Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

cs.AI [Back]

[184] A Survey on Large Language Models for Mathematical Reasoning

Peng-Yuan Wang,Tian-Shuo Liu,Chenyang Wang,Yi-Di Wang,Shu Yan,Cheng-Xing Jia,Xu-Hui Liu,Xin-Wei Chen,Jia-Cheng Xu,Ziniu Li,Yang Yu

Main category: cs.AI

TL;DR: 该论文综述了大型语言模型（LLMs）在数学推理能力上的发展，分为理解与答案生成两个认知阶段，并探讨了提升方法及未来研究方向。

Details

Motivation: 数学推理是人工智能研究中的核心挑战之一，近年来LLMs在该领域取得显著进展，但仍有容量、效率和泛化等基本问题。 Method: 通过预训练策略和答案生成技术（如Chain-of-Thought推理）提升数学推理能力，涵盖从无训练提示到微调方法。 Result: 尽管LLMs在数学推理上取得进步，但仍面临挑战，未来可通过高级预训练、知识增强和形式化推理框架进一步改进。 Conclusion: 论文为提升LLMs推理能力提供了见解，并展望了其在其他领域的应用潜力。 Abstract: Mathematical reasoning has long represented one of the most fundamental and challenging frontiers in artificial intelligence research. In recent years, large language models (LLMs) have achieved significant advances in this area. This survey examines the development of mathematical reasoning abilities in LLMs through two high-level cognitive phases: comprehension, where models gain mathematical understanding via diverse pretraining strategies, and answer generation, which has progressed from direct prediction to step-by-step Chain-of-Thought (CoT) reasoning. We review methods for enhancing mathematical reasoning, ranging from training-free prompting to fine-tuning approaches such as supervised fine-tuning and reinforcement learning, and discuss recent work on extended CoT and "test-time scaling". Despite notable progress, fundamental challenges remain in terms of capacity, efficiency, and generalization. To address these issues, we highlight promising research directions, including advanced pretraining and knowledge augmentation techniques, formal reasoning frameworks, and meta-generalization through principled learning paradigms. This survey tries to provide some insights for researchers interested in enhancing reasoning capabilities of LLMs and for those seeking to apply these techniques to other domains.

[185] Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Kongcheng Zhang,Qi Yao,Shunyu Liu,Yingjie Wang,Baisheng Lai,Jieping Ye,Mingli Song,Dacheng Tao

Main category: cs.AI

TL;DR: 提出了一种自奖励强化学习框架CoVo，通过利用中间推理状态的一致性来增强大型语言模型的推理能力，无需外部监督。

Details

Motivation: 强化学习在复杂推理任务中潜力巨大，但依赖外部监督限制了其广泛应用。 Method: 引入CoVo机制，结合一致性和波动性作为内在奖励，辅以好奇心奖励促进探索。 Result: 在多个推理基准测试中，CoVo性能媲美甚至超越有监督的强化学习。 Conclusion: CoVo为无监督推理学习提供了可扩展的解决方案。 Abstract: Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.

[186] Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery

Yuni Susanti,Michael Färber

Main category: cs.AI

TL;DR: 提出了一种结合知识图谱（KG）与大语言模型（LLM）的新方法，用于提升基于知识的因果发现，显著优于现有基线方法。

Details

Motivation: 传统基于观测数据的因果发现方法存在局限性，而现有基于LLM的方法结果不稳定且不可靠。 Method: 通过识别KG中的元路径子图，并利用学习排序模型优化子图选择，将其融入零样本提示中以增强LLM的因果推理能力。 Result: 在生物医学和开放领域数据集上，F1分数最高提升44.4分，显著优于基线方法。 Conclusion: 结合KG与LLM的方法有效提升了因果发现的稳定性和准确性，为复杂系统中的因果推理提供了新思路。 Abstract: Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery -- which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) -- offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality

[187] Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini,José Hernández-Orallo,Lorenzo Pacchiardi

Main category: cs.AI

TL;DR: 本文综述了大型语言模型（LLM）在数据科学中作为助手和代理的评估现状，发现研究集中在少数目标导向活动，忽视了数据管理和探索性活动，且缺乏对中间层次人机协作的探讨。

Details

Motivation: 探讨LLM在数据科学中的应用现状及其局限性，以推动更全面的评估和更高效的人机协作。 Method: 通过文献综述，分析LLM助手和代理在数据科学中的评估现状。 Result: 发现研究集中在目标导向活动，忽视数据管理和探索性活动，且缺乏对中间层次人机协作的探讨。 Conclusion: 未来研究应更全面地评估LLM在数据科学中的应用，并探索任务转换带来的更高层次自动化。 Abstract: Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances--such as code execution and knowledge bases--that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

[188] FEDTAIL: Federated Long-Tailed Domain Generalization with Sharpness-Guided Gradient Matching

Sunny Gupta,Nikita Jangid,Shounak Das,Amit Sethi

Main category: cs.AI

TL;DR: FedTAIL是一个联邦领域泛化框架，通过锐度引导的梯度对齐优化解决长尾分布和优化冲突问题，实现了在未见目标域上的稳定性能。

Details

Motivation: 现有方法在长尾类分布和冲突优化目标下表现不佳，FedTAIL旨在通过梯度对齐和类感知正则化解决这些问题。 Method: 引入梯度一致性正则化器减少分类与对抗目标的冲突，提出曲率感知动态加权方案处理类不平衡，并集成锐度感知扰动增强条件分布对齐。 Result: 在标准领域泛化基准测试中，FedTAIL表现出色，尤其在域偏移和标签不平衡情况下达到最优性能。 Conclusion: FedTAIL通过统一优化协调、类感知正则化和条件对齐，验证了其在集中式和联邦设置中的有效性。 Abstract: Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: https://github.com/sunnyinAI/FedTail

[189] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang,Xiufeng Song,Heng Zhou,Yiran Qin,Jie Yang,Xiaohong Liu,Philip Torr,Lei Bai,Zhenfei Yin

Main category: cs.AI

TL;DR: VIKI-Bench是首个针对多智能体协作的分层基准测试，结合视觉输入和多样化机器人形态，提出了VIKI-R框架，通过两阶段训练显著提升性能。

Details

Motivation: 解决多智能体在动态环境中协作的挑战，特别是视觉驱动的推理和多样化形态的支持。 Method: 提出VIKI-Bench基准测试和VIKI-R框架，包括基于视觉语言模型的微调和强化学习。 Result: VIKI-R在所有任务层级上显著优于基线方法，并展现出异构智能体的组合协作模式。 Conclusion: VIKI-Bench和VIKI-R为多智能体视觉驱动协作提供了统一的测试平台和方法。 Abstract: Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

eess.AS [Back]

[190] Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs

Šimon Sedláček,Bolaji Yusuf,Ján Švec,Pradyoth Hegde,Santosh Kesiraju,Oldřich Plchot,Jan Černocký

Main category: eess.AS

TL;DR: 论文通过连接语音编码器和LLM的表示空间，提出了一种开源的对话状态跟踪方法，并在SpokenWOZ数据集上取得了最佳性能。

Details

Motivation: 研究动机是解决语音对话状态跟踪问题，同时强调完全开源和开放数据的组件。 Method: 方法包括使用小型连接模块桥接语音编码器和LLM，实验了不同微调方式、对话历史的影响以及模糊匹配后处理。 Result: 最佳模型在SpokenWOZ测试集上达到34.66% JGA，使用Gemma-2-9B-instruct进一步提升至42.17% JGA。 Conclusion: 结论表明该方法在语音对话状态跟踪任务中表现优异，且开源组件具有实际应用潜力。 Abstract: In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.

cs.DB [Back]

[191] RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu,Zhihan Zhang,Kate Lin,Yuwei Zhang,Akshay Paruchuri,Hong Yu,Mehran Kazemi,Kumar Ayush,A. Ali Heydari,Maxwell A. Xu,Girish Narayanswamy,Yun Liu,Ming-Zher Poh,Yuzhe Yang,Mark Malhotra,Shwetak Patel,Hamid Palangi,Xuhai Xu,Daniel McDuff,Tim Althoff,Xin Liu

Main category: cs.DB

TL;DR: RADAR是一个用于评估语言模型在表格数据中处理数据异常能力的基准测试，揭示了前沿模型在数据异常存在时性能显著下降的问题。

Details

Motivation: 语言模型在自主数据分析中的部署日益增多，但其对数据异常（如缺失值、异常值等）的识别和处理能力尚未充分研究，这些异常可能严重影响分析结论的有效性。 Method: 通过程序化扰动模拟数据异常，构建了包含2980个表格查询对的RADAR基准测试，覆盖9个领域和5种数据异常类型，并系统性地调整表格大小以评估模型性能。 Result: 前沿模型在无数据异常的表格上表现良好，但在引入数据异常后性能显著下降，暴露了其在数据感知分析能力上的不足。 Conclusion: RADAR为提升表格推理能力提供了灵活且可扩展的资源，支持多种扰动类型和可控表格大小。 Abstract: Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

cs.CR [Back]

[192] GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors

Wenlong Meng,Shuguo Fan,Chengkun Wei,Min Chen,Yuwei Li,Yuanchao Zhang,Zhikun Zhang,Wenzhi Chen

Main category: cs.CR

TL;DR: GradEscape是一种基于梯度的攻击方法，用于对抗AI生成文本检测器，通过加权嵌入和模型参数更新实现高效攻击，并在实验中优于现有方法。

Details

Motivation: 解决AI生成文本检测器的不可微计算问题和分词器不匹配问题，提升攻击效果。 Method: 引入加权嵌入和暖启动方法，结合分词器推断和模型提取技术，优化攻击模型参数。 Result: 在多个数据集和语言模型上表现优异，参数效率高，成功应用于商业检测器。 Conclusion: 揭示了训练数据中文本表达风格的差异是主要漏洞，提出了防御策略并开源了GradEscape。 Abstract: In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.

q-bio.BM [Back]

[193] Aligning Proteins and Language: A Foundation Model for Protein Retrieval

Qifeng Wu,Zhengzhe Liu,Han Zhu,Yizhou Zhao,Daisuke Kihara,Min Xu

Main category: q-bio.BM

TL;DR: 提出了一种基于CLIP框架的方法，通过对比学习将3D蛋白质结构与功能注释对齐，用于大规模蛋白质数据集中的相似性检索。

Details

Motivation: 利用视觉语言模型（VLMs）的最新进展，促进从结构测定方法（如冷冻电镜）中获取的蛋白质结构的功能解释。 Method: 采用CLIP风格的框架，通过对比学习对齐3D蛋白质结构和功能注释，并使用约20万蛋白质-描述对的数据集进行训练。 Result: 在PDB和EMDB数据集上展示了良好的零样本检索性能，验证了多模态基础模型在蛋白质生物学中的潜力。 Conclusion: 该方法为蛋白质结构和功能理解提供了新的多模态解决方案，具有广泛的应用前景。 Abstract: This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology.

eess.IV [Back]

[194] A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation

Isha Puri,David Cox

Main category: eess.IV

TL;DR: 提出了一种灵活、鲁棒且高精度的模型，用于啮齿类动物瞳孔和角膜反射识别，解决了现有技术未考虑啮齿类动物眼部独特特性的问题。

Details

Motivation: 现有眼动追踪技术主要针对人类眼睛，未考虑啮齿类动物眼部的独特性（如参数变异性、周围毛发多、尺寸小），因此需要一种专门的方法。 Method: 采用基于卷积神经网络的生物医学图像分割架构，可增量训练以适应眼部参数的变异性，并结合自动红外视频眼动记录系统。 Result: 开发了一种高精度且实用的方法，首次实现了啮齿类动物瞳孔和角膜反射的准确识别，为神经科学和视觉科学研究提供了先进技术。 Conclusion: 该方法填补了啮齿类动物眼动追踪技术的空白，为相关领域研究提供了更可靠的工具。 Abstract: Research in neuroscience and vision science relies heavily on careful measurements of animal subject's gaze direction. Rodents are the most widely studied animal subjects for such research because of their economic advantage and hardiness. Recently, video based eye trackers that use image processing techniques have become a popular option for gaze tracking because they are easy to use and are completely noninvasive. Although significant progress has been made in improving the accuracy and robustness of eye tracking algorithms, unfortunately, almost all of the techniques have focused on human eyes, which does not account for the unique characteristics of the rodent eye images, e.g., variability in eye parameters, abundance of surrounding hair, and their small size. To overcome these unique challenges, this work presents a flexible, robust, and highly accurate model for pupil and corneal reflection identification in rodent gaze determination that can be incrementally trained to account for variability in eye parameters encountered in the field. To the best of our knowledge, this is the first paper that demonstrates a highly accurate and practical biomedical image segmentation based convolutional neural network architecture for pupil and corneal reflection identification in eye images. This new method, in conjunction with our automated infrared videobased eye recording system, offers the state of the art technology in eye tracking for neuroscience and vision science research for rodents.

[195] Snap-and-tune: combining deep learning and test-time optimization for high-fidelity cardiovascular volumetric meshing

Daniel H. Pak,Shubh Thaker,Kyle Baylous,Xiaoran Zhang,Danny Bluestein,James S. Duncan

Main category: eess.IV

TL;DR: 本文提出了一种结合深度学习和测试时优化的新策略，用于从医学图像生成高质量体积网格，显著提高了空间精度和网格质量。

Details

Motivation: 在个性化医疗中，基于物理的模拟需要高质量的体积网格，但现有深度学习方法在高曲率区域和部件间距方面存在局限性。 Method: 采用“snap-and-tune”策略，先通过深度学习快速拟合初始形状，再通过测试时优化进行样本特异性网格修正。 Result: 方法在空间精度和网格质量上均有显著提升，且完全自动化，无需额外训练标签。 Conclusion: 新生成的网格在两种不同软件平台上展示了其多功能性和实用性，代码已开源。 Abstract: High-quality volumetric meshing from medical images is a key bottleneck for physics-based simulations in personalized medicine. For volumetric meshing of complex medical structures, recent studies have often utilized deep learning (DL)-based template deformation approaches to enable fast test-time generation with high spatial accuracy. However, these approaches still exhibit limitations, such as limited flexibility at high-curvature areas and unrealistic inter-part distances. In this study, we introduce a simple yet effective snap-and-tune strategy that sequentially applies DL and test-time optimization, which combines fast initial shape fitting with more detailed sample-specific mesh corrections. Our method provides significant improvements in both spatial accuracy and mesh quality, while being fully automated and requiring no additional training labels. Finally, we demonstrate the versatility and usefulness of our newly generated meshes via solid mechanics simulations in two different software platforms. Our code is available at https://github.com/danpak94/Deep-Cardiac-Volumetric-Mesh.

[196] Plug-and-Play Linear Attention for Pre-trained Image and Video Restoration Models

Srinivasan Kidambi,Pravin Nair

Main category: eess.IV

TL;DR: PnP-Nystra是一种基于Nyström的线性自注意力近似方法，作为即插即用模块，无需重新训练即可集成到预训练的视觉模型中，显著提升计算效率。

Details

Motivation: 多头自注意力（MHSA）在视觉模型中计算复杂度高，成为实时和资源受限环境中的瓶颈。 Method: 提出PnP-Nystra，通过线性近似替代MHSA，适用于多种窗口式Transformer架构。 Result: 在图像和视频修复任务中，PnP-Nystra实现了2-5倍加速，PSNR损失最大仅1.5 dB。 Conclusion: PnP-Nystra是首个无需训练的线性注意力替代方案，显著提升了计算效率。 Abstract: Multi-head self-attention (MHSA) has become a core component in modern computer vision models. However, its quadratic complexity with respect to input length poses a significant computational bottleneck in real-time and resource constrained environments. We propose PnP-Nystra, a Nystr\"om based linear approximation of self-attention, developed as a plug-and-play (PnP) module that can be integrated into the pre-trained image and video restoration models without retraining. As a drop-in replacement for MHSA, PnP-Nystra enables efficient acceleration in various window-based transformer architectures, including SwinIR, Uformer, and RVRT. Our experiments across diverse image and video restoration tasks, including denoising, deblurring, and super-resolution, demonstrate that PnP-Nystra achieves a 2-4x speed-up on an NVIDIA RTX 4090 GPU and a 2-5x speed-up on CPU inference. Despite these significant gains, the method incurs a maximum PSNR drop of only 1.5 dB across all evaluated tasks. To the best of our knowledge, we are the first to demonstrate a linear attention functioning as a training-free substitute for MHSA in restoration models.

[197] DCD: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber View

Donglian Li,Hui Guo,Minglang Chen,Huizhen Chen,Jialing Chen,Bocheng Liang,Pengchen Liang,Ying Tan

Main category: eess.IV

TL;DR: 提出了一种基于深度学习的模型DCD，用于胎儿心尖四腔视图的自动分割，以提高分割精度并减少超声医师的工作量。

Details

Motivation: 胎儿心尖四腔视图的精确分割对先天性心脏病的早期诊断至关重要，但受超声伪影、噪声和解剖变异等因素影响，分割仍具挑战性。 Method: DCD模型结合了Dense ASPP模块和CBAM模块，实现了多尺度特征提取和自适应特征表示。 Result: DCD能够有效捕捉局部和全局上下文信息，实现精确且鲁棒的分割。 Conclusion: DCD模型有助于提升产前心脏评估的准确性。 Abstract: Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workload of sonographers and enhance segmentation accuracy, we propose DCD, an advanced deep learning-based model for automatic segmentation of key anatomical structures in the fetal A4C view. Our model incorporates a Dense Atrous Spatial Pyramid Pooling (Dense ASPP) module, enabling superior multi-scale feature extraction, and a Convolutional Block Attention Module (CBAM) to enhance adaptive feature representation. By effectively capturing both local and global contextual information, DCD achieves precise and robust segmentation, contributing to improved prenatal cardiac assessment.

[198] Biologically Inspired Deep Learning Approaches for Fetal Ultrasound Image Classification

Rinat Prochii,Elizaveta Dakhova,Pavel Birulin,Maxim Sharaev

Main category: eess.IV

TL;DR: 提出了一种生物启发的深度学习集成框架，用于同时分类16种胎儿超声图像结构，性能优于现有方法。

Details

Motivation: 解决胎儿超声图像分类中因图像质量低、类内差异大和类别不平衡带来的挑战。 Method: 采用双分支（浅层路径和详细路径）的模块化架构，结合EfficientNet-B0和EfficientNet-B6模型，使用LDAM-Focal损失函数。 Result: 在5298张临床图像上测试，90%的器官分类准确率>0.75，75%的器官>0.85。 Conclusion: 生物启发的模块化堆叠方法在复杂临床环境中表现出鲁棒性和可扩展性。 Abstract: Accurate classification of second-trimester fetal ultrasound images remains challenging due to low image quality, high intra-class variability, and significant class imbalance. In this work, we introduce a simple yet powerful, biologically inspired deep learning ensemble framework that-unlike prior studies focused on only a handful of anatomical targets-simultaneously distinguishes 16 fetal structures. Drawing on the hierarchical, modular organization of biological vision systems, our model stacks two complementary branches (a "shallow" path for coarse, low-resolution cues and a "detailed" path for fine, high-resolution features), concatenating their outputs for final prediction. To our knowledge, no existing method has addressed such a large number of classes with a comparably lightweight architecture. We trained and evaluated on 5,298 routinely acquired clinical images (annotated by three experts and reconciled via Dawid-Skene), reflecting real-world noise and variability rather than a "cleaned" dataset. Despite this complexity, our ensemble (EfficientNet-B0 + EfficientNet-B6 with LDAM-Focal loss) identifies 90% of organs with accuracy > 0.75 and 75% of organs with accuracy > 0.85-performance competitive with more elaborate models applied to far fewer categories. These results demonstrate that biologically inspired modular stacking can yield robust, scalable fetal anatomy recognition in challenging clinical settings.

[199] MAMBO: High-Resolution Generative Approach for Mammography Images

Milica Škipina,Nikola Jovišić,Nicola Dall'Asen,Vanja Švenda,Anil Osman Tur,Slobodan Ilić,Elisa Ricci,Dubravko Ćulibrk

Main category: eess.IV

TL;DR: 论文提出了一种名为MAMBO的基于扩散模型的乳腺X光片生成方法，能够生成高分辨率图像，用于增强AI训练和异常检测。

Details

Motivation: 由于隐私和伦理限制，获取大规模多样化的乳腺X光片数据集困难，影响了AI系统的训练效果。 Method: MAMBO采用基于块的扩散模型，结合局部和全局上下文信息，生成3840x3840像素的高分辨率乳腺X光片。 Result: 实验表明，MAMBO能生成高度真实的图像，并可用于分类模型训练和异常检测。 Conclusion: MAMBO在乳腺X光片分析和早期病变检测方面具有潜在应用价值。 Abstract: Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final patch-based model, significantly aiding the noise removal process. This thoughtful design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly detection. Experiments, both numerical and radiologist validation, assess MAMBO's capabilities in image generation, super-resolution, and anomaly detection, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection.

[200] Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment

Maximilian Tschuchnig,Lukas Lamminger,Philipp Steininger,Michael Gadermayr

Main category: eess.IV

TL;DR: 通过多模态学习增强CBCT到CT的合成（sCT）生成，显著提升低质量CBCT-CT对齐情况下的图像质量。

Details

Motivation: CBCT虽具高分辨率，但存在明显伪影，视觉质量低于传统CT。多模态学习结合术前CT和术中CBCT，旨在提升sCT生成质量。 Method: 利用多模态学习整合术中CBCT与术前CT，通过合成数据集分析CBCT-CT对齐及CBCT质量对sCT的影响。 Result: 多模态sCT在低质量CBCT-CT对齐情况下表现最佳，显著优于单模态基线方法。 Conclusion: 多模态学习方法在真实临床数据中具有高度可重复性，能有效提升sCT生成质量。 Abstract: Cone-Beam Computed Tomography (CBCT) is widely used for real-time intraoperative imaging due to its low radiation dose and high acquisition speed. However, despite its high resolution, CBCT suffers from significant artifacts and thereby lower visual quality, compared to conventional Computed Tomography (CT). A recent approach to mitigate these artifacts is synthetic CT (sCT) generation, translating CBCT volumes into the CT domain. In this work, we enhance sCT generation through multimodal learning, integrating intraoperative CBCT with preoperative CT. Beyond validation on two real-world datasets, we use a versatile synthetic dataset, to analyze how CBCT-CT alignment and CBCT quality affect sCT quality. The results demonstrate that multimodal sCT consistently outperform unimodal baselines, with the most significant gains observed in well-aligned, low-quality CBCT-CT cases. Finally, we demonstrate that these findings are highly reproducible in real-world clinical datasets.

q-bio.NC [Back]

[201] Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain

Subba Reddy Oota,Khushbu Pahwa,Prachi Jindal,Satya Sai Srinath Namburi,Maneesh Singh,Tanmoy Chakraborty,Bapi S. Raju,Manish Gupta

Main category: q-bio.NC

TL;DR: 研究发现，指令调优的多模态大语言模型（MLLMs）在视频和音频任务中显著优于非指令调优模型，且其分层结构与大脑活动对齐。

Details

Motivation: 填补现有研究在评估MLLMs与大脑对齐时对多模态刺激和指令调优模型的忽视。 Method: 使用指令调优的MLLMs生成任务特定表示，测量其与自然电影观看中记录的神经活动的预测性。 Result: 指令调优视频MLLMs比非指令调优多模态和单模态模型分别高出15%和20%，且分层结构与大脑区域对齐。 Conclusion: 任务特定指令显著提升MLLMs与大脑活动的对齐，为多模态信息处理研究开辟新途径。 Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

cs.HC [Back]

[202] SakugaFlow: A Stagewise Illustration Framework Emulating the Human Drawing Process and Providing Interactive Tutoring for Novice Drawing Skills

Kazuki Kawamura,Jun Rekimoto

Main category: cs.HC

TL;DR: SakugaFlow是一个四阶段流程，结合扩散模型和大型语言模型，为初学者提供实时反馈，支持非线性和多版本创作，将黑盒生成器转化为学习工具。

Details

Motivation: 当前AI绘图工具虽能生成高质量图像，但缺乏人类艺术家的分步过程，无法帮助初学者学习。 Method: 采用四阶段流程，结合扩散模型生成图像和语言模型提供实时反馈，支持非线性修改和多版本分支。 Result: 通过展示中间输出和嵌入教学对话，SakugaFlow成为支持创意探索和技能获取的学习环境。 Conclusion: SakugaFlow成功将黑盒生成器转化为一个支持学习和创作的工具。 Abstract: While current AI illustration tools can generate high-quality images from text prompts, they rarely reveal the step-by-step procedure that human artists follow. We present SakugaFlow, a four-stage pipeline that pairs diffusion-based image generation with a large-language-model tutor. At each stage, novices receive real-time feedback on anatomy, perspective, and composition, revise any step non-linearly, and branch alternative versions. By exposing intermediate outputs and embedding pedagogical dialogue, SakugaFlow turns a black-box generator into a scaffolded learning environment that supports both creative exploration and skills acquisition.

[203] MOSAIC-F: A Framework for Enhancing Students' Oral Presentation Skills through Personalized Feedback

Alvaro Becerra,Daniel Andres,Pablo Villegas,Roberto Daza,Ruth Cobos

Main category: cs.HC

TL;DR: MOSAIC-F是一个多模态反馈框架，结合人类评估和多模态数据分析，为学生提供个性化反馈。

Details

Motivation: 通过整合多模态数据和人类评估，提供更准确、个性化的学习反馈。 Method: 框架包括四个步骤：标准化评估、多模态数据收集、AI生成反馈、学生自我评估与可视化。 Result: 在提升口头表达能力的测试中，框架表现良好。 Conclusion: MOSAIC-F结合人类与数据评估，提供更有效的学习反馈。 Abstract: In this article, we present a novel multimodal feedback framework called MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments for generating personalized feedback on student learning activities. This framework consists of four key steps. First, peers and professors' assessments are conducted through standardized rubrics (that include both quantitative and qualitative evaluations). Second, multimodal data are collected during learning activities, including video recordings, audio capture, gaze tracking, physiological signals (heart rate, motion data), and behavioral interactions. Third, personalized feedback is generated using AI, synthesizing human-based evaluations and data-based multimodal insights such as posture, speech patterns, stress levels, and cognitive load, among others. Finally, students review their own performance through video recordings and engage in self-assessment and feedback visualization, comparing their own evaluations with peers and professors' assessments, class averages, and AI-generated recommendations. By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback. We tested MOSAIC-F in the context of improving oral presentation skills.

cs.RO [Back]

[204] PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Liang Ma,Jiajun Wen,Min Lin,Rongtao Xu,Xiwen Liang,Bingqian Lin,Jun Ma,Yongxin Wang,Ziming Wei,Haokun Lin,Mingfei Han,Meng Cao,Bokui Chen,Ivan Laptev,Xiaodan Liang

Main category: cs.RO

TL;DR: PhyBlock是一个用于评估视觉语言模型（VLMs）在物理理解和规划能力的渐进式基准，通过3D积木组装任务和VQA任务测试模型的空间推理和物理理解能力。实验发现VLMs在高级规划和推理能力上表现有限，尤其在任务复杂度增加时性能下降显著。

Details

Motivation: 现有VLMs在理解物理现象和结构化3D环境中的能力有限，需要一种新的评估方法来填补这一空白。 Method: 提出PhyBlock基准，包含2600个任务（400个组装任务和2200个VQA任务），评估模型在部分完成、失败诊断和规划鲁棒性三个维度的表现。 Result: 实验评估了21个先进VLMs，发现其在高级规划和推理能力上表现不足，尤其在复杂任务中性能显著下降。错误分析显示模型在空间方向和依赖推理上存在困难。 Conclusion: PhyBlock为推进具身推理提供了统一测试平台，揭示了VLMs在物理问题解决中的局限性，并强调了进一步改进的需求。 Abstract: While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

math.NA [Back]

[205] Normalized Radon Cumulative Distribution Transforms for Invariance and Robustness in Optimal Transport Based Image Classification

Matthias Beckmann,Robert Beinert,Jonas Bresch

Main category: math.NA

TL;DR: 本文研究了max-normalized R-CDT在非仿射图像变形下的鲁棒性，并提出了mean-normalized R-CDT以应对更广泛的图像变形和脉冲噪声。

Details

Motivation: 解决实际应用中图像因测量过程受到一般仿射变换影响的问题，并进一步研究R-CDT在非仿射变形下的鲁棒性。 Method: 提出max-normalized和mean-normalized R-CDT，通过控制Wasserstein距离分析其鲁棒性。 Result: 理论分析和数值实验表明，新特征提取器对局部非仿射变形和脉冲噪声具有鲁棒性。 Conclusion: max-normalized和mean-normalized R-CDT在图像分类任务中表现出色，尤其适用于小数据场景。 Abstract: The Radon cumulative distribution transform (R-CDT), is an easy-to-compute feature extractor that facilitates image classification tasks especially in the small data regime. It is closely related to the sliced Wasserstein distance and provably guaranties the linear separability of image classes that emerge from translations or scalings. In many real-world applications, like the recognition of watermarks in filigranology, however, the data is subject to general affine transformations originating from the measurement process. To overcome this issue, we recently introduced the so-called max-normalized R-CDT that only requires elementary operations and guaranties the separability under arbitrary affine transformations. The aim of this paper is to continue our study of the max-normalized R-CDT especially with respect to its robustness against non-affine image deformations. Our sensitivity analysis shows that its separability properties are stable provided the Wasserstein-infinity distance between the samples can be controlled. Since the Wasserstein-infinity distance only allows small local image deformations, we moreover introduce a mean-normalized version of the R-CDT. In this case, robustness relates to the Wasserstein-2 distance and also covers image deformations caused by impulsive noise for instance. Our theoretical results are supported by numerical experiments showing the effectiveness of our novel feature extractors as well as their robustness against local non-affine deformations and impulsive noise.

cs.LG [Back]

[206] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

Chenxi Liu,Tianyi Xiong,Ruibo Chen,Yihan Wu,Junfeng Guo,Tianyi Zhou,Heng Huang

Main category: cs.LG

TL;DR: MBPO提出了一种新的偏好学习框架，通过生成硬负样本和在线验证奖励，解决LMM中的模态不平衡问题，提升性能并减少幻觉。

Details

Motivation: 现有LMM在推理中存在模态不平衡问题，语言先验偏置过强，导致下游任务泛化能力差和幻觉现象。现有偏好优化方法未有效抑制LLM内部偏置，且依赖离线数据。 Method: MBPO通过对抗性扰动生成硬负样本构建离线偏好数据集，并利用闭端任务生成在线验证奖励数据，结合GRPO进行训练。 Result: 实验表明MBPO能显著提升LMM在视觉语言任务中的性能，并有效减少幻觉。 Conclusion: MBPO通过模态平衡的偏好优化，解决了LMM的模态不平衡问题，为未来研究提供了新方向。 Abstract: The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.

[207] Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning

Hanbing Liu,Lang Cao,Yuanyi Ren,Mengyu Zhou,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang

Main category: cs.LG

TL;DR: Bingo是一个RL框架，通过改进长度奖励设计，提升语言模型的高效推理能力，同时兼顾准确性。

Details

Motivation: 现有方法在提升语言模型推理效率时，常因简单长度奖励导致准确性下降，需更精细的奖励设计。 Method: Bingo引入显著性感知长度奖励和动态长度奖励，逐步减少无关标记并动态调整推理深度。 Result: 实验表明，Bingo在多个基准测试中优于基线方法，实现了准确性与效率的平衡。 Conclusion: Bingo展示了通过明确训练提升语言模型高效推理的潜力。 Abstract: Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.

[208] AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Yifei Li,Hanane Nour Moussa,Ziru Chen,Shijie Chen,Botao Yu,Mingyi Xue,Benjamin Burns,Tzu-Yao Chiu,Vishal Dey,Zitong Lu,Chen Wei,Qianheng Zhang,Tianyu Zhang,Song Gao,Xuhui Huang,Xia Ning,Nesreen K. Ahmed,Ali Payani,Huan Sun

Main category: cs.LG

TL;DR: AutoSDT是一个自动收集高质量科学发现编码任务的管道，构建了AutoSDT-5K数据集，并训练出性能接近GPT-4o的AutoSDT-Coder模型。

Details

Motivation: 解决科学发现中高质量数据稀缺的问题，加速AI辅助科学发现。 Method: 利用LLMs的编码能力和参数知识，自动搜索、筛选和合成任务及代码解决方案。 Result: 构建了包含5,404个任务的AutoSDT-5K数据集，模型在多个基准测试中表现优异，接近GPT-4o水平。 Conclusion: AutoSDT为科学发现提供了高质量数据集和高效模型，显著提升了AI辅助科学发现的潜力。 Abstract: Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.

[209] Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints

Yaswanth Chittepu,Blossom Metevier,Will Schwarzer,Austin Hoag,Scott Niekum,Philip S. Thomas

Main category: cs.LG

TL;DR: HC-RLHF方法通过高置信度安全强化学习，确保语言模型在敏感领域的安全性和有用性。

Details

Motivation: 现有方法常将安全性与有用性对立，导致敏感领域的不当回应，需改进。 Method: HC-RLHF将人类偏好分解为有用性和安全性，分别训练奖励模型和成本模型，分两步优化并验证安全性。 Result: 实验表明，HC-RLHF能高概率生成安全模型，并在安全性和有用性上优于现有方法。 Conclusion: HC-RLHF提供了一种可靠的语言模型对齐方法，兼顾安全性和有用性。 Abstract: Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness, which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability and can improve harmlessness and helpfulness compared to previous methods.

[210] From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

Xie Yi,Zhanke Zhou,Chentao Cao,Qiyu Niu,Tongliang Liu,Bo Han

Main category: cs.LG

TL;DR: ECON通过将多LLM协调建模为不完全信息博弈并寻求贝叶斯纳什均衡，显著提升了推理能力，同时降低了计算成本。

Details

Motivation: 多智能体框架虽能增强LLM的推理能力，但通常计算成本高且缺乏收敛保证。 Method: 将多LLM协调视为不完全信息博弈，引入ECON框架，结合分布式推理与集中输出，无需高成本交互。 Result: ECON在六个基准测试中平均优于现有方法11.2%，并证明其可扩展性。 Conclusion: ECON为更强大的多LLM集成提供了高效且可扩展的解决方案。 Abstract: Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON's ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.

[211] From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Zhanke Zhou,Xiao Feng,Zhaocheng Zhu,Jiangchao Yao,Sanmi Koyejo,Bo Han

Main category: cs.LG

TL;DR: AR-Bench是一个新基准，用于评估大语言模型（LLM）的主动推理能力，发现当前模型在主动推理方面表现较差。

Details

Motivation: 现有基准主要评估被动推理，而主动推理（需要与外部系统交互获取信息）缺乏系统性研究。 Method: 设计了AR-Bench，包含三类任务（侦探案件、情境谜题和猜数字），模拟现实场景并测试常识、逻辑和符号推理能力。 Result: 当代LLM在主动推理中表现不佳，难以获取或利用必要信息，且改进策略效果有限。 Conclusion: 需发展新方法（如交互学习、实时反馈和环境感知目标）以提升主动推理能力。 Abstract: While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.

[212] Reinforce LLM Reasoning through Multi-Agent Reflection

Yurun Yuan,Tengyang Xie

Main category: cs.LG

TL;DR: 论文提出DPSDP算法，通过强化学习训练LLM系统，动态优化答案，提升推理能力。

Details

Motivation: 现有方法在反馈空间和协同训练方面存在不足，导致性能不佳。 Method: 将多轮优化建模为马尔可夫决策过程，提出DPSDP算法，通过直接偏好学习训练LLM系统。 Result: 在MATH 500基准上，准确率从58.2%提升至63.2%。 Conclusion: DPSDP在多智能体协作和分布外泛化方面表现优异。 Abstract: Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.

[213] Reinforcement Learning Teachers of Test Time Scaling

Edoardo Cetin,Tianyu Zhao,Yujin Tang

Main category: cs.LG

TL;DR: 论文提出了一种新的框架Reinforcement-Learned Teachers (RLTs)，通过避免强化学习的探索挑战，专注于生成最有效的下游蒸馏模型。RLTs通过详细解释问题与答案的连接，显著提升了学生模型的性能。

Details

Motivation: 传统强化学习训练推理语言模型依赖于初始探索能力，且模型主要用于蒸馏新学生而非直接部署。RLTs旨在解决探索挑战并优化蒸馏效果。 Method: RLTs框架通过提示问题和答案，生成详细解释，并通过密集奖励训练模型，奖励基于学生模型对解释的理解。 Result: 7B规模的RLT在竞赛和研究生级任务中表现优于现有蒸馏和冷启动方法，且能有效训练更大学生模型并适应分布外任务。 Conclusion: RLTs为强化学习推理框架提供了高效和可复用的新方法，显著提升了性能和应用范围。 Abstract: Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.

[214] The Geometries of Truth Are Orthogonal Across Tasks

Waiss Azizian,Michael Kirchhof,Eugene Ndiaye,Louis Bethune,Michal Klein,Pierre Ablin,Marco Cuturi

Main category: cs.LG

TL;DR: 研究发现，基于激活向量区分LLM答案正确性的方法存在任务依赖性，无法跨任务泛化。

Details

Motivation: 探讨LLM在推理时激活向量是否能通用地判断答案正确性，揭示现有方法的局限性。 Method: 通过线性分类器分析不同任务中激活向量的相似性，并尝试混合探针和任务的方法。 Result: 发现激活向量在不同任务间差异显著，线性分类器支持集几乎不重叠，更复杂方法也无法克服此问题。 Conclusion: LLM的“真理几何”具有任务依赖性，跨任务泛化能力有限。 Abstract: Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a "geometry of truth" can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these "geometries of truth" are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

[215] Gridding Forced Displacement using Semi-Supervised Learning

Andrew Wells,Geraldine Henningsen,Brice Bolane Tchinde Kengne

Main category: cs.LG

TL;DR: 半监督方法将难民统计数据从行政区划细化到0.5度网格，结合多种数据源，实现高精度难民分布分析。

Details

Motivation: 解决传统难民统计数据在区域和国家层面过于笼统的问题，揭示更细致的难民分布模式。 Method: 整合UNHCR注册数据、Google Open Buildings的建筑足迹和OpenStreetMap的地点坐标，采用标签传播算法生成高精度网格数据。 Result: 方法平均准确率达92.9%，成功将1000多万难民数据分配到网格，揭示局部流离失所模式。 Conclusion: 高分辨率数据集为深入研究难民流离失所的驱动因素提供了基础。 Abstract: We present a semi-supervised approach that disaggregates refugee statistics from administrative boundaries to 0.5-degree grid cells across 25 Sub-Saharan African countries. By integrating UNHCR's ProGres registration data with satellite-derived building footprints from Google Open Buildings and location coordinates from OpenStreetMap Populated Places, our label spreading algorithm creates spatially explicit refugee statistics at high granularity.This methodology achieves 92.9% average accuracy in placing over 10 million refugee observations into appropriate grid cells, enabling the identification of localized displacement patterns previously obscured in broader regional and national statistics. The resulting high-resolution dataset provides a foundation for a deeper understanding of displacement drivers.

[216] Bi-level Unbalanced Optimal Transport for Partial Domain Adaptation

Zi-Ying Chen,Chuan-Xian Ren,Hong Yan

Main category: cs.LG

TL;DR: BUOT模型通过双层不平衡最优传输框架，同时表征样本级和类级关系，解决了部分域适应问题中的异常类识别和知识迁移问题。

Details

Motivation: 部分域适应问题需要对齐跨域样本并识别异常类，现有权重框架仅能表征样本级关系，对异常类识别不足且易受预测不准确影响。 Method: 提出双层不平衡最优传输（BUOT）模型，结合样本级和类级传输，通过协作机制提供结构信息和判别信息。 Result: 在基准数据集上的实验验证了BUOT的竞争力。 Conclusion: BUOT模型有效解决了部分域适应问题，提升了异常类识别和知识迁移的准确性。 Abstract: Partial domain adaptation (PDA) problem requires aligning cross-domain samples while distinguishing the outlier classes for accurate knowledge transfer. The widely used weighting framework tries to address the outlier classes by introducing the reweighed source domain with a similar label distribution to the target domain. However, the empirical modeling of weights can only characterize the sample-wise relations, which leads to insufficient exploration of cluster structures, and the weights could be sensitive to the inaccurate prediction and cause confusion on the outlier classes. To tackle these issues, we propose a Bi-level Unbalanced Optimal Transport (BUOT) model to simultaneously characterize the sample-wise and class-wise relations in a unified transport framework. Specifically, a cooperation mechanism between sample-level and class-level transport is introduced, where the sample-level transport provides essential structure information for the class-level knowledge transfer, while the class-level transport supplies discriminative information for the outlier identification. The bi-level transport plan provides guidance for the alignment process. By incorporating the label-aware transport cost, the local transport structure is ensured and a fast computation formulation is derived to improve the efficiency. Extensive experiments on benchmark datasets validate the competitiveness of BUOT.

[217] UniVarFL: Uniformity and Variance Regularized Federated Learning for Heterogeneous Data

Sunny Gupta,Nikita Jangid,Amit Sethi

Main category: cs.LG

TL;DR: UniVarFL是一种新的联邦学习框架，通过两种正则化策略解决非IID数据导致的性能下降问题，优于现有方法。

Details

Motivation: 联邦学习在非IID数据下性能下降严重，传统方法成本高或适应性差。 Method: UniVarFL采用分类器方差正则化和超球面均匀正则化，模拟IID训练动态。 Result: 在多个基准数据集上，UniVarFL在准确性上优于现有方法。 Conclusion: UniVarFL是一种高效、可扩展的解决方案，适用于资源受限的实际部署。 Abstract: Federated Learning (FL) often suffers from severe performance degradation when faced with non-IID data, largely due to local classifier bias. Traditional remedies such as global model regularization or layer freezing either incur high computational costs or struggle to adapt to feature shifts. In this work, we propose UniVarFL, a novel FL framework that emulates IID-like training dynamics directly at the client level, eliminating the need for global model dependency. UniVarFL leverages two complementary regularization strategies during local training: Classifier Variance Regularization, which aligns class-wise probability distributions with those expected under IID conditions, effectively mitigating local classifier bias; and Hyperspherical Uniformity Regularization, which encourages a uniform distribution of feature representations across the hypersphere, thereby enhancing the model's ability to generalize under diverse data distributions. Extensive experiments on multiple benchmark datasets demonstrate that UniVarFL outperforms existing methods in accuracy, highlighting its potential as a highly scalable and efficient solution for real-world FL deployments, especially in resource-constrained settings. Code: https://github.com/sunnyinAI/UniVarFL

[218] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Xiao Liang,Zhong-Zhi Li,Yeyun Gong,Yang Wang,Hengyuan Zhang,Yelong Shen,Ying Nian Wu,Weizhu Chen

Main category: cs.LG

TL;DR: 论文提出了一种自我感知弱点驱动的问题合成框架（SwS），通过识别模型在强化学习中的弱点并生成针对性问题，显著提升了语言模型在复杂推理任务中的表现。

Details

Motivation: 现有问题集质量不足且缺乏针对性，限制了强化学习在语言模型训练中的效果。 Method: 通过识别模型在训练中的失败案例，提取核心概念并合成新问题，针对性强化模型的弱点。 Result: 在8个主流推理基准测试中，7B和32B模型平均性能分别提升10.0%和7.7%。 Conclusion: SwS框架无需外部知识蒸馏，通过自我识别和解决弱点，实现了模型的稳健泛化。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

[219] e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur,Matthew Y. R. Yang,Charlie Snell,Jeremy Greer,Ian Wu,Virginia Smith,Max Simchowitz,Aviral Kumar

Main category: cs.LG

TL;DR: 论文提出e3方法，通过训练LLM进行上下文探索，以提升推理能力并实现计算预算外的性能扩展。

Details

Motivation: 现有推理模型在计算预算外的性能扩展表现不佳，需要一种方法使其在更长“思考”时间内提升性能。 Method: 提出e3方法，包括：1）链式操作（如生成与验证）；2）利用错误轨迹的“负”梯度增强探索；3）通过课程设计将任务难度与训练预算耦合。 Result: e3方法使1.7B模型在AIME'25和HMMT'25上表现最佳，并能扩展到2倍训练预算。 Conclusion: e3方法不仅提高了单次推理性能，还改善了多次推理的稳定性。 Abstract: Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

[220] An Adaptive Method Stabilizing Activations for Enhanced Generalization

Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko

Main category: cs.LG

TL;DR: AdaAct是一种通过调整激活方差来优化学习率的新算法，提高了神经元输出的稳定性，并在标准图像分类任务中表现优异。

Details

Motivation: 传统激活正则化方法在训练过程中缺乏神经元级别的适应性，AdaAct旨在填补这一空白，同时结合Adam的快速收敛和SGD的强泛化能力。 Method: AdaAct通过根据激活方差动态调整学习率，增强神经元输出的稳定性，并在训练过程中引入神经元级别的适应性。 Result: 在CIFAR和ImageNet等标准图像分类任务中，AdaAct表现优异，同时兼具Adam的快速收敛和SGD的泛化能力。 Conclusion: AdaAct是一种有效的优化算法，能够平衡收敛速度和泛化能力，适用于广泛的深度学习任务。 Abstract: We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization -- a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct's competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at https://github.com/hseung88/adaact.

[221] Boosting Gradient Leakage Attacks: Data Reconstruction in Realistic FL Settings

Mingyuan Fan,Fuyi Wang,Cen Chen,Jianying Zhou

Main category: cs.LG

TL;DR: 本文通过实证研究证明，即使在现实的联邦学习环境中，客户数据仍可能被有效重建，并提出FedLeak方法以解决梯度泄漏攻击的性能瓶颈。

Details

Motivation: 探讨联邦学习中的隐私风险，尤其是梯度泄漏攻击在实际环境中的有效性，填补现有研究的空白。 Method: 提出FedLeak方法，包含部分梯度匹配和梯度正则化两种新技术，并设计了一个基于文献和行业实践的实际评估协议。 Result: FedLeak在现实联邦学习环境中仍能实现高保真数据重建，揭示了系统的重大漏洞。 Conclusion: 联邦学习系统存在显著隐私风险，亟需更有效的防御方法。 Abstract: Federated learning (FL) enables collaborative model training among multiple clients without the need to expose raw data. Its ability to safeguard privacy, at the heart of FL, has recently been a hot-button debate topic. To elaborate, several studies have introduced a type of attacks known as gradient leakage attacks (GLAs), which exploit the gradients shared during training to reconstruct clients' raw data. On the flip side, some literature, however, contends no substantial privacy risk in practical FL environments due to the effectiveness of such GLAs being limited to overly relaxed conditions, such as small batch sizes and knowledge of clients' data distributions. This paper bridges this critical gap by empirically demonstrating that clients' data can still be effectively reconstructed, even within realistic FL environments. Upon revisiting GLAs, we recognize that their performance failures stem from their inability to handle the gradient matching problem. To alleviate the performance bottlenecks identified above, we develop FedLeak, which introduces two novel techniques, partial gradient matching and gradient regularization. Moreover, to evaluate the performance of FedLeak in real-world FL environments, we formulate a practical evaluation protocol grounded in a thorough review of extensive FL literature and industry practices. Under this protocol, FedLeak can still achieve high-fidelity data reconstruction, thereby underscoring the significant vulnerability in FL systems and the urgent need for more effective defense methods.

[222] HSG-12M: A Large-Scale Spatial Multigraph Dataset

Xianquan Yan,Hakan Akgün,Kenji Kawaguchi,N. Duane Loh,Ching Hua Lee

Main category: cs.LG

TL;DR: HSG-12M是首个大规模空间多图数据集，保留了节点间多条几何路径，基于物理数据生成，为几何感知图学习奠定基础。

Details

Motivation: 现有图基准假设边是非空间且简单的，忽略了物理路径的多样性，因此需要一种能保留几何多样性的新数据集。 Method: 通过Poly2Graph管道，将一维晶体哈密顿量的光谱数据转换为空间多图，生成静态和动态的哈密顿光谱图。 Result: HSG-12M包含1160万静态和510万动态图，覆盖1401个特征多项式类，揭示了多边几何学习的新挑战。 Conclusion: HSG-12M为几何感知图学习提供了新工具，并展示了光谱图作为多项式、向量和矩阵的通用拓扑指纹的潜力。 Abstract: Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal's energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.

[223] Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers

Simon Roschmann,Quentin Bouniot,Vasilii Feofanov,Ievgen Redko,Zeynep Akata

Main category: cs.LG

TL;DR: TiViT将时间序列转换为图像，利用预训练的视觉Transformer（ViT）提升时间序列分类性能，并在标准基准测试中达到最佳表现。

Details

Motivation: 时间序列分类在医疗和工业中很重要，但公开数据集稀缺限制了时间序列基础模型（TSFM）的发展。 Method: 提出Time Vision Transformer（TiViT），将时间序列转为图像，利用预训练ViT的表示能力。 Result: TiViT在标准时间序列分类基准中表现最佳，并发现中间层高维度表示最有效。 Conclusion: TiViT展示了视觉表示在非视觉领域的复用潜力，并与TSFM表示空间互补。 Abstract: Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.

cs.CY [Back]

[224] Surgeons Awareness, Expectations, and Involvement with Artificial Intelligence: a Survey Pre and Post the GPT Era

Lorenzo Arboit,Dennis N. Schneider,Toby Collins,Daniel A. Hashimoto,Silvana Perretta,Bernard Dallemagne,Jacques Marescaux,EAES Working Group,Nicolas Padoy,Pietro Mascagni

Main category: cs.CY

TL;DR: 研究通过2021和2024年的全球调查，分析了外科医生对AI在手术中的认知、期望和参与情况，发现AI意识和课程参与度上升，但基础知识仍有限，伦理问题受关注，基础设施是主要障碍。

Details

Motivation: 探讨AI在手术中的应用潜力及其对外科医生认知和行为的影响。 Method: 通过2021和2024年的两次全球横断面调查，评估外科医生的AI意识、期望、参与度和伦理问题。 Result: AI意识和课程参与度显著提升，但基础知识不足；伦理问题成为焦点；基础设施是主要障碍；外科医生对AI持乐观态度。 Conclusion: 尽管外科医生对AI持积极态度，仍需通过教育、伦理框架和基础设施发展解决知识缺口和实施障碍。 Abstract: Artificial Intelligence (AI) is transforming medicine, with generative AI models like ChatGPT reshaping perceptions of its potential. This study examines surgeons' awareness, expectations, and involvement with AI in surgery through comparative surveys conducted in 2021 and 2024. Two cross-sectional surveys were distributed globally in 2021 and 2024, the first before an IRCAD webinar and the second during the annual EAES meeting. The surveys assessed demographics, AI awareness, expectations, involvement, and ethics (2024 only). The surveys collected a total of 671 responses from 98 countries, 522 in 2021 and 149 in 2024. Awareness of AI courses rose from 14.5% in 2021 to 44.6% in 2024, while course attendance increased from 12.9% to 23%. Despite this, familiarity with foundational AI concepts remained limited. Expectations for AI's role shifted in 2024, with hospital management gaining relevance. Ethical concerns gained prominence, with 87.2% of 2024 participants emphasizing accountability and transparency. Infrastructure limitations remained the primary obstacle to implementation. Interdisciplinary collaboration and structured training were identified as critical for successful AI adoption. Optimism about AI's transformative potential remained high, with 79.9% of respondents believing AI would positively impact surgery and 96.6% willing to integrate AI into their clinical practice. Surgeons' perceptions of AI are evolving, driven by the rise of generative AI and advancements in surgical data science. While enthusiasm for integration is strong, knowledge gaps and infrastructural challenges persist. Addressing these through education, ethical frameworks, and infrastructure development is essential.

cs.SD [Back]

[225] Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Ailin Huang,Bingxin Li,Bruce Wang,Boyong Wu,Chao Yan,Chengli Feng,Heng Wang,Hongyu Zhou,Hongyuan Wang,Jingbei Li,Jianjian Sun,Joanna Wang,Mingrui Chen,Peng Liu,Ruihang Miao,Shilei Jiang,Tian Fei,Wang You,Xi Chen,Xuerui Yang,Yechang Huang,Yuxiang Zhang,Zheng Ge,Zheng Gong,Zhewei Huang,Zixin Zhang,Bin Wang,Bo Li,Buyun Ma,Changxin Miao,Changyi Wan,Chen Xu,Dapeng Shi,Dingyuan Hu,Enle Liu,Guanzhe Huang,Gulin Yan,Hanpeng Hu,Haonan Jia,Jiahao Gong,Jiaoren Wu,Jie Wu,Jie Yang,Junzhe Lin,Kaixiang Li,Lei Xia,Longlong Gu,Ming Li,Nie Hao,Ranchen Ming,Shaoliang Pang,Siqi Liu,Song Yuan,Tiancheng Cao,Wen Li,Wenqing He,Xu Zhao,Xuelin Zhang,Yanbo Yu,Yinmin Zhong,Yu Zhou,Yuanwei Liang,Yuanwei Lu,Yuxiang Yang,Zidong Yang,Zili Zhang,Binxing Jiao,Heung-Yeung Shum,Jiansheng Chen,Jing Li,Xiangyu Zhang,Xinhao Zhang,Yibo Zhu,Daxin Jiang,Shuchang Zhou,Chen Hu

Main category: cs.SD

TL;DR: Step-Audio-AQAA是一种端到端的大型音频语言模型，通过双码本音频标记器和1300亿参数LLM，结合神经声码器，实现了高质量的语音合成，显著提升了音频交互的自然性。

Details

Motivation: 现有大型音频语言模型依赖文本输出，限制了直接生成自然语音响应的能力，阻碍了无缝音频交互。 Method: 模型采用双码本音频标记器提取特征，结合1300亿参数LLM和神经声码器，通过交错标记输出和DPO优化提升性能。 Result: 在StepEval-Audio-360基准测试中，Step-Audio-AQAA在语音控制等关键领域表现优异，超越现有最佳模型。 Conclusion: 该研究为端到端音频语言模型提供了可行方案，并强调了基于标记的声码器在提升AQAA任务性能中的关键作用。 Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

q-fin.ST [Back]

[226] EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Issa Sugiura,Takashi Ishida,Taro Makino,Chieko Tazuke,Takanori Nakagawa,Kosuke Nakago,David Ha

Main category: q-fin.ST

TL;DR: 论文介绍了EDINET-Bench，一个开源日本金融基准数据集，用于评估大语言模型在金融任务中的表现，结果显示当前模型表现不佳，需领域特定优化。

Details

Motivation: 金融分析领域缺乏具有挑战性的数据集，尤其是日本金融数据，阻碍了大语言模型在该领域的发展和应用。 Method: 通过从日本EDINET系统下载过去10年的年报，自动标注任务标签，构建EDINET-Bench数据集。 Result: 实验表明，即使是先进的大语言模型，在欺诈检测和盈利预测等任务中表现仅略优于逻辑回归。 Conclusion: 研究强调了大语言模型在金融应用中面临的挑战，并呼吁领域特定优化，同时公开数据集和代码以促进未来研究。 Abstract: Financial analysis presents complex challenges that could leverage large language model (LLM) capabilities. However, the scarcity of challenging financial datasets, particularly for Japanese financial data, impedes academic innovation in financial analytics. As LLMs advance, this lack of accessible research resources increasingly hinders their development and evaluation in this specialized domain. To address this gap, we introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. EDINET-Bench is constructed by downloading annual reports from the past 10 years from Japan's Electronic Disclosure for Investors' NETwork (EDINET) and automatically assigning labels corresponding to each evaluation task. Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting. These results highlight significant challenges in applying LLMs to real-world financial applications and underscore the need for domain-specific adaptation. Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs.

cs.IR [Back]

[227] Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval

Abdellah Ghassel,Ian Robinson,Gabriel Tanase,Hal Cooper,Bryan Thompson,Zhen Han,Vassilis N. Ioannidis,Soji Adeshina,Huzefa Rangwala

Main category: cs.IR

TL;DR: 论文提出了Hierarchical Lexical Graph (HLG) 和两种检索器（StatementGraphRAG和TopicGraphRAG），以解决RAG在跨文档语义检索中的不足，并通过合成数据集验证了其优越性。

Details

Motivation: 现有RAG方法在跨文档语义检索时表现不佳，需要一种更高效的方法来整合分散的信息。 Method: 提出HLG三层索引结构，并基于此开发两种检索器：StatementGraphRAG（细粒度检索）和TopicGraphRAG（粗粒度检索）。同时，设计了合成数据集生成流程以评估多跳检索系统。 Result: 实验表明，该方法在五个数据集上平均提升了23.1%的检索召回率和准确性。 Conclusion: HLG和配套检索器显著提升了跨文档检索性能，开源工具为后续研究提供了支持。 Abstract: Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.

Table of Contents

cs.CV [Back]

[1] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

[2] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

[3] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

[4] Spectral Domain Neural Reconstruction for Passband FMCW Radars

[5] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework

[6] Open World Scene Graph Generation using Vision Language Models

[7] Generative Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes

[8] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

[9] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

[10] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation

[11] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

[12] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

[13] Highly Compressed Tokenizer Can Generate Without Training

[14] Seeing Voices: Generating A-Roll Video from Audio with Mirage

[15] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

[16] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal

[17] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating

[18] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera

[19] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

[20] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

[21] Image Demoiréing Using Dual Camera Fusion on Mobile Phones

[22] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

[23] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation

[24] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

[25] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

[26] MARMOT: Masked Autoencoder for Modeling Transient Imaging

[27] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

[28] MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

[29] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

[30] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s

[31] TrajFlow: Multi-modal Motion Prediction via Flow Matching

[32] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs

[33] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

[34] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement

[35] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection

[36] Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations

[37] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

[38] Transformers Meet Hyperspectral Imaging: A Comprehensive Study of Models, Challenges and Open Problems

[39] Towards Class-wise Fair Adversarial Training via Anti-Bias Soft Label Distillation

[40] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

[41] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything

[42] A Probability-guided Sampler for Neural Implicit Surface Rendering

[43] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network

[44] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

[45] SurfR: Surface Reconstruction with Multi-scale Attention

[46] Orientation Matters: Making 3D Generative Models Orientation-Aligned

[47] Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization

[48] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping

[49] LLaVA-c: Continual Improved Visual Instruction Tuning

[50] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

[51] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities

[52] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

[53] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

[54] ArrowPose: Segmentation, Detection, and 5 DoF Pose Estimation Network for Colorless Point Clouds

[55] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering

[56] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

[57] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces

[58] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba

[59] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation

[60] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting

[61] Landsat-Bench: Datasets and Benchmarks for Landsat Foundation Models

[62] HomographyAD: Deep Anomaly Detection Using Self Homography Learning

[63] A PDE-Based Image Dehazing Method via Atmospheric Scattering Theory

[64] Flow Diverse and Efficient: Learning Momentum Flow Matching via Stochastic Velocity Field Sampling

[65] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

[66] HiSin: Efficient High-Resolution Sinogram Inpainting via Resolution-Guided Progressive Inference

[67] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

[68] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

[69] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

[70] Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning

[71] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

[72] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

[73] Product of Experts for Visual Generation

[74] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos

[75] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis

[76] Hyperbolic Dual Feature Augmentation for Open-Environment

[77] SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping

[78] Inherently Faithful Attention Maps for Vision Transformers