Motivation: 大语言模型在通用领域表现出色，但在法律等专业领域因缺乏领域特定预训练而表现不佳，且法律文本通常冗长复杂，难以有效处理。 Method: 在零样本设置下，通过对三个印度法律判决预测数据集进行实验，分析文档重组、定义修辞角色和模拟法院逐步推理对模型性能的影响。 Result: 组织数据或解释关键法律术语显著提升了模型性能，F1分数相比基线最少提高约1.5%，最高提升达4.36%。 Conclusion: 通过结构化信息呈现、术语解释和模拟人类推理过程，可以有效增强大语言模型在法律领域的理解和推理能力，无需完全的领域内训练。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种称为“warping”的新型算子，统一了深度学习中各种动态采样方法，并对其进行了理论分析，揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别，同时探讨了动态采样网络稳定训练的条件及离散化效应。

Details

Motivation: 现有的动态采样机制在多种计算机视觉模型中表现出色，但缺乏统一的理论分析框架。为了建立统一视角并深入理解其行为，需要一种能够概括现有方法的通用形式。 Method: 提出了‘warping’算子作为动态采样的通用形式，通过建模输入为独立同分布变量和齐次随机场进行统计分析，并引入基于梯度更新的新型损失景观可视化方法来研究学习行为。 Result: 证明了warping可重构可变形卷积、主动卷积单元和空间变换网络等结构；发现了前向与反向传播之间的独特不对称性；表明该类算子构成了一类不同于传统平移不变卷积的新正交算子类别；给出了确保训练稳定的条件，并分析了离散化带来的统计影响。 Conclusion: 动态采样机制代表了一类全新的运算结构，warping为分析此类模型提供了简洁且可推广的理论框架，有助于设计更稳定、高效的动态网络架构。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了Δ-NeRF，一种用于增量式NeRF优化的模块化残差框架，适用于数据流式到达的场景（如卫星遥感）。该方法通过残差控制器、不确定性感知门控机制和视图选择策略，在无需重训和存储历史数据的情况下实现高效更新，并结合知识蒸馏压缩模型，显著提升训练效率与性能。

Details

Motivation: 现有NeRF方法在新增视图时需重新训练，难以应对数据持续到来的实际场景（如卫星对地观测），且易发生灾难性遗忘。因此需要一种支持增量学习、避免重训并保留历史知识的方法。 Method: 提出Δ-NeRF：1）引入残差控制器，向冻结的基础NeRF中注入逐层修正；2）设计不确定性感知门控机制，自适应融合基础与修正预测，防止过修正；3）采用视图选择策略减少训练数据量；4）使用知识蒸馏将增强模型压缩为原大小20%的学生网络。 Result: 在卫星图像上实验表明，Δ-NeRF性能媲美联合训练，训练时间减少30-42%；相比朴素微调PSNR最高提升43.5%，并在某些指标上优于联合训练。视图选择可减少47%训练数据而不损性能。 Conclusion: Δ-NeRF实现了高效的增量NeRF优化，解决了灾难性遗忘问题，兼顾性能、效率与模型大小，特别适用于长期、连续观测的应用场景如卫星地形分析。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge（StM）框架，通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details

Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景与背景层，进行自组合学习；引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于当前最先进方法。 Conclusion: StM能有效学习复杂动态组合规律，实现更逼真的视频生成并提升可控性。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境，包含25种任务类型，评估显示当前最先进的大模型表现远低于人类，而使用可验证奖励的强化学习能显著提升性能。

Details

Motivation: 为了推动视觉和多模态推理的发展，需要一个具有可验证真值解的可控、系统化的基准测试环境。 Method: 提出Sphinx环境，通过程序化生成包含多种视觉元素的谜题，并引入强化学习与可验证奖励（RLVR）来提升模型性能。 Result: 最先进的LVLM（如GPT-5）在Sphinx上仅达到51.1%的准确率，远低于人类；RLVR方法显著提升了模型在该基准及其他外部基准上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展且可评估的平台，RLVR是一种有前景的改进多模态模型推理能力的方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉反演方法OVI，用于替代扩散模型中昂贵的文本到图像先验网络，并通过引入两种新约束提升生成图像质量，实验表明该方法在多个指标上可与现有最优方法媲美。

Details

Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络，本文旨在探索是否可以完全避免使用此类训练型先验。 Method: 采用基于优化的视觉反演（OVI），从随机伪标记初始化潜在视觉表示，并通过最大化与文本提示嵌入的余弦相似性进行迭代优化；提出Mahalanobis正则化和最近邻损失两种新约束来引导优化过程。 Result: 在Kandinsky 2.2上实验显示，仅用文本嵌入作先验在T2I-CompBench++上得分虚高，而OVI结合最近邻约束能显著提升图像视觉保真度，定量指标达到或超过当前最先进的数据高效先验方法。 Conclusion: OVI作为一种无需训练、无需数据的先验替代方案是可行的，且性能具有竞争力，揭示了当前评估基准存在的问题，表明该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 提出Canvas-to-Image框架，通过统一画布界面整合多种异构控制信号，实现高保真图像生成。

Details

Motivation: 现有扩散模型在多模态、组合性控制（如文本、参考图、姿态、布局等）下难以精确遵循用户意图。 Method: 将多种控制信号编码为单一复合画布图像，并采用多任务画布训练策略，在统一范式下联合优化模型对异构控制的理解与集成。 Result: 在多人组合、姿态控制、布局约束和多控制生成等任务上显著优于现有方法，尤其在身份保持和控制一致性方面表现突出。 Conclusion: Canvas-to-Image实现了对复杂用户意图的高保真还原，支持灵活、统一的多模态控制图像生成。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[15] Length-MAX Tokenizer for Language Models

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[29] Developing an Open Conversational Speech Corpus for the Isan Language

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[36] A Systematic Study of Model Merging Techniques in Large Language Models

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

[51] Foundry: Distilling 3D Foundation Models for the Edge

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

[54] Text-Guided Semantic Image Encoder

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

[58] Intriguing Properties of Dynamic Sampling Networks

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

[60] Layer-Aware Video Composition via Split-then-Merge

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

[66] Estimating Fog Parameters from a Sequence of Stereo Images

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

[69] GaINeR: Geometry-Aware Implicit Network Representation

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

[71] Smooth regularization for efficient video recognition

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI