Motivation: 大语言模型在通用领域表现出色，但在法律等专业领域因缺乏领域特定预训练而表现不佳，且法律文本通常冗长复杂，难以有效处理。 Method: 在零样本设置下，通过三种方式改进模型：(i) 根据修辞角色重组文档；(ii) 定义修辞角色以引入法律术语；(iii) 模拟法院逐步推理过程。实验在三个印度法律判决预测数据集上进行。 Result: 组织数据或解释关键法律术语显著提升了模型性能，F1分数相比基线最低提升约1.5%，最高达4.36%。 Conclusion: 通过结构化信息呈现、术语定义和模拟人类推理过程，可有效增强大语言模型在复杂法律任务中的理解和推理能力。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子，统一了深度学习中各种动态采样方法，并对其进行了理论分析，揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别，同时探讨了动态采样网络稳定训练的条件和离散化效应。

Details

Motivation: 现有的动态采样机制在多个计算机视觉模型中表现出色，但缺乏统一的理论分析框架。为了建立统一视角并深入理解这些方法的性质，需要一种更基础且可分析的通用算子。 Method: 提出了“warping”算子作为动态采样的通用形式，通过统计建模输入为独立同分布变量和齐次随机场进行理论分析，并结合数值实验研究前向与反向传播特性、离散化影响及训练稳定性。此外引入一种基于梯度更新的新颖损失景观可视化方法。 Result: 证明了warping可重构可变形卷积、主动卷积单元和空间变换网络等结构；发现了动态采样机制在前向与反向传播之间的独特不对称性；指出其属于不同于传统平移不变卷积的一类正交算子；给出了确保训练稳定的条件；分析了离散化带来的统计效应；提出了新的损失景观可视化技术。 Conclusion: 动态采样机制可通过warping算子统一建模，其具有独特的数学结构和训练动力学特性，区别于传统卷积，本文为设计和训练此类模型提供了理论基础和实用工具。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了一种名为$Δ$-NeRF的模块化残差框架，用于在不访问历史数据的情况下对NeRF进行增量式优化，适用于卫星遥感等连续观测场景。

Details

Motivation: 现有NeRF方法在新视角加入时需重新训练，难以应对数据流式到达的场景（如卫星地形分析），且易发生灾难性遗忘。 Method: $Δ$-NeRF采用冻结的基础NeRF，并引入残差控制器注入每层修正；结合不确定性感知的门控机制自适应融合预测结果；设计视图选择策略减少训练数据量，并利用知识蒸馏压缩模型。 Result: 在卫星图像上实验表明，$Δ$-NeRF性能媲美联合训练，训练时间减少30-42%；相比朴素微调PSNR最高提升43.5%，部分指标优于联合训练，且模型可压缩至原大小的20%。 Conclusion: $Δ$-NeRF有效解决了NeRF在增量学习中的灾难性遗忘问题，实现了高效、紧凑的持续优化，具备在遥感等实际场景中长期部署的潜力。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge（StM）框架，通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details

Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景和背景层，进行自组合训练；引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于当前最先进方法。 Conclusion: StM能有效学习动态主体与场景的交互，实现更真实的视频生成。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境，包含25种任务类型，评估显示即使最先进的GPT-5模型准确率也仅为51.1%，远低于人类表现；采用可验证奖励的强化学习（RLVR）能显著提升模型性能。

Details

Motivation: 旨在构建一个具有可验证真值解的可控环境，以系统评估和提升视觉语言模型在核心认知任务上的推理能力。 Method: 通过程序化生成包含多种视觉元素（如图案、图表、几何图形等）的谜题，构建包含25类任务的基准测试，并使用可验证奖励的强化学习（RLVR）来优化模型性能。 Result: 当前最先进的大视觉语言模型（如GPT-5）在Sphinx上仅达到51.1%的准确率，显著低于人类水平；而RLVR方法能有效提升模型在此类任务及外部视觉推理基准上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展、可评估的测试平台，揭示了现有LVLMs在认知推理上的不足，并表明RLVR是提升多模态模型推理能力的有效途径。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI，用于替代文本到图像生成中昂贵的扩散先验网络，并通过新提出的约束机制提升生成图像质量，同时揭示了当前评估基准存在的缺陷。

Details

Motivation: 现有的文本到图像扩散模型依赖计算成本高昂且需大量数据训练的先验网络，本文旨在挑战这一必要性，探索更高效、轻量化的替代方案。 Method: 提出基于优化的视觉反演（OVI），通过随机伪标记初始化潜在视觉表示，并迭代优化使其与文本提示嵌入的余弦相似性最大化；引入马氏距离和最近邻损失两种新约束来正则化优化过程。 Result: 在Kandinsky 2.2上的实验表明，OVI可有效替代传统先验；分析发现当前T2I-CompBench++等基准存在缺陷——仅用文本嵌入作先验也能得高分；所提约束方法尤其是最近邻法显著提升视觉保真度，定量指标媲美甚至超过现有最先进轻量级先验。 Conclusion: OVI为文本到图像生成提供了一种无需训练的先验替代方案，具有潜力；同时呼吁对现有评估基准进行反思和改进。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 本文提出了Canvas-to-Image框架，通过将多种异构控制信号（如文本、姿态、布局等）统一编码到一个复合画布图像中，实现高保真、多模态的图像生成控制。

Details

Motivation: 现有扩散模型在同时处理文本、参考图像、空间布局等多种控制输入时难以保证生成图像的忠实性和组合性，缺乏统一的控制机制。 Method: 提出将多种控制信号融合为单一的复合画布图像，并设计多任务画布训练策略，在统一学习范式下训练扩散模型以联合理解这些控制信号。 Result: 在多任务数据集上验证了方法的有效性，实验表明该方法在身份保持、控制一致性等方面显著优于现有最先进方法，尤其在多人组合、姿态控制、布局约束等复杂场景中表现突出。 Conclusion: Canvas-to-Image提供了一种统一且通用的多模态控制框架，能够有效整合多种异构控制信号，提升扩散模型在复杂用户意图下的生成忠实度与灵活性。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[15] Length-MAX Tokenizer for Language Models

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[29] Developing an Open Conversational Speech Corpus for the Isan Language

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[36] A Systematic Study of Model Merging Techniques in Large Language Models

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

[51] Foundry: Distilling 3D Foundation Models for the Edge

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

[54] Text-Guided Semantic Image Encoder

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

[58] Intriguing Properties of Dynamic Sampling Networks

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

[60] Layer-Aware Video Composition via Split-then-Merge

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

[66] Estimating Fog Parameters from a Sequence of Stereo Images

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

[69] GaINeR: Geometry-Aware Implicit Network Representation

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

[71] Smooth regularization for efficient video recognition

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI