.. py:currentmodule:: drjit .. _changelog: Changelog ######### DrJit 1.2.0 (October 17, 2025) ------------------------------ **New Features** - **Event API**: Added an event API for fine-grained timing and synchronization of GPU kernels. This enables more detailed performance profiling and better control over asynchronous operations. (Dr.Jit PR `#441 `__, Dr.Jit-Core PR `#174 `__). - **OpenGL Interoperability**: Improved CUDA-OpenGL interoperability with simplified APIs. This enables efficient sharing of data between CUDA kernels and OpenGL rendering. (Dr.Jit PR `#429 `__, Dr.Jit-Core PR `#164 `__, contributed by `Merlin Nimier-David `__). - **Enhanced Int8/UInt8 Support**: Improved support for 8-bit integer types with better casting and bitcast operations. (Dr.Jit PR `#428 `__, Dr.Jit-Core PR `#163 `__, contributed by `Merlin Nimier-David `__). **Performance Improvements** - **Register Spilling to Shared Memory**: CUDA backend now supports spilling registers to shared memory, improving performance for kernels with high register pressure. (Dr.Jit-Core commit `fdc7cae7`). - **Memory View Support**: Arrays can now be converted to Python ``memoryview`` objects for efficient zero-copy data access. (commit `b7039184`). - **DLPack GIL Release**: The ``dr.ArrayBase.dlpack()`` method now releases the GIL while waiting, improving multi-threaded performance. (commit `0adf9b4a`). - **Thread Synchronization**: ``dr.sync_thread()`` now releases the GIL while waiting, preventing unnecessary blocking in multi-threaded applications. (commit `956d2f57`). **API Improvements** - **Spherical Direction Utilities**: Added Python implementation of spherical direction utilities (``dr.sphdir``). (PR `#432 `__, contributed by `Sébastien Speierer `__). - **Matrix Conversions**: Added support for converting between 3D and 4D matrices: ``Matrix4f`` can be constructed from a 3D matrix and ``Matrix3f`` from a 4D matrix. (commit `7f8ea890`). - **Quaternion API**: Improved the quaternion Python API for better usability and consistency. (commit `282da88a`). - **Type casts**: Allow casting between Dr.Jit types to properly allow AD<->non-AD conversions when required. (commit `72f1e6b2`). **Bug Fixes** - Fixed deadlock issues in ``@dr.freeze`` decorator. (commit `e8fc555e`). - Fixed gradient tracking in ``Texture.tensor()`` to ensure gradients are never dropped inadvertently. (PR `#444 `__). - Fixed AD support for C++ ``repeat`` and ``tile`` operations with proper gradient propagation. (commits `fd693056`, `282da88a`). - Fixed Python object traversal to check that ``__dict__`` exists before accessing it, preventing crashes with certain object types. (commit `433adaf0`). - Fixed symbolic loop size calculation to properly account for side-effects. (Dr.Jit-Core commit `31bf911`). - Fixed read-after-free issue in OptiX SBT data loading. (Dr.Jit-Core commit `009adef`, contributed by `Merlin Nimier-David `__). **Other Improvements** - Updated to nanobind `v2.9.2 `__ - Improved error messages by adding function names to vectorized call errors. (Dr.Jit-Core PR `#165 `__, contributed by `Sébastien Speierer `__). - Added missing checks for JIT leak warnings. (Dr.Jit-Core PR `#166 `__, contributed by `Sébastien Speierer `__). - Added warning for LLVM API initialization failures. (Dr.Jit-Core PR `#168 `__, contributed by `Sébastien Speierer `__). - Fixed pytest warnings and improved test infrastructure. (PR `#436 `__). DrJit 1.1.0 (August 7, 2025) ---------------------------- The v1.1.0 release of Dr.Jit includes several major new features: **Major Features** - **Cooperative Vectors**: Dr.Jit now provides an API to efficiently evaluate matrix-vector products in parallel programs. The API targets small matrices (e.g., 128x128, 64×64, or smaller) and inlines all computation into the program. Threads work cooperatively to perform these operations efficiently. On NVIDIA GPUs (Turing or newer), this leverages the OptiX cooperative vector API with tensor core acceleration. On the LLVM backend, operations compile to sequences of packet instructions (e.g., AVX512). See the :ref:`cooperative vector documentation ` for more details. Example: .. code-block:: python import drjit as dr import drjit.nn as nn from drjit.cuda.ad import Float16, TensorXf16 # Create a random number generator rng = dr.rng(seed=0) # Create a matrix and bias representing an affine transformation A = rng.normal(TensorXf16, shape=(3, 16)) # 3×16 matrix b = TensorXf16([1, 2, 3]) # Bias vector # Pack into optimized memory layout buffer, A_view, b_view = nn.pack(A, b) # Create cooperative a vector from 16 inputs vec_in = nn.CoopVec(Float16(1), Float16(2), ...) # Perform matrix-vector multiplication: A @ vec_in + b vec_out = nn.matvec(A_view, vec_in, b_view) # Unpack result back to regular arrays x, y, z = vec_out (Dr.Jit PR `#384 `__, Dr.Jit-Core PR `#141 `__). - **Neural Network Library**: Building on the cooperative vector functionality, the new :py:mod:`drjit.nn` module provides modular abstractions for constructing, evaluating, and optimizing neural networks, similar to PyTorch's ``nn.Module``. This enables fully fused evaluation of small multilayer perceptrons (MLPs) within larger programs. See the :ref:`neural network module documentation ` for more details. Example: .. code-block:: python import drjit.nn as nn from drjit.cuda.ad import TensorXf16, Float16 # Define a small MLP for function approximation net = nn.Sequential( nn.SinEncode(16), # Sinusoidal encoding nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, 3, bias=False), # Output layer (3 outputs) nn.Tanh() ) # Instantiate and optimize for 16-bit tensor cores rng = dr.rng(seed=0) net = net.alloc(dtype=TensorXf16, size=2, rng=rng) weights, net = nn.pack(net, layout='training') # Evaluate the network inputs = nn.CoopVec(Float16(0.5), Float16(0.7)) outputs = net(inputs) x, y, z = outputs # Three output values (PR `#384 `__). - **Hash Grid Encoding**: Added neural network hash grid encoding inspired by `Instant NGP `__, providing multi-resolution spatial encodings. This includes both traditional hash grids and `permutohedral encodings `__ for efficient high-dimensional inputs. (PR `#390 `__, contributed by `Christian Döring `__ and `Merlin Nimier-David `__). - **Function Freezing**: Added the :py:func:`@dr.freeze ` decorator to eliminate repeated tracing overhead by caching and replaying JIT-compiled kernels. Dr.Jit normally traces operations to build computation graphs for compilation, which can become a bottleneck when the same complex computation is performed repeatedly (e.g., in optimization loops). The decorator records kernel launches on the first call and replays them directly on subsequent calls, avoiding re-tracing. This can dramatically accelerate programs and makes Dr.Jit usable for realtime rendering and other applications with strict timing requirements. See the :ref:`function freezing documentation ` for more details. Example: .. code-block:: python import drjit as dr from drjit.cuda import Float, UInt32 # Without freezing - traces every time def func(x): y = seriously_complicated_code(x) dr.eval(y) # ..intermediate evaluations.. return huge_function(y, x) # With freezing - traces only once @dr.freeze def frozen(x): ... # same code as above -- no changes needed (Dr.Jit PR `#336 `__, Dr.Jit-Core PR `#107 `__, contributed by `Christian Döring `__). - **Shader Execution Reordering (SER)**: Added the function :py:func:`dr.reorder_threads() ` to shuffle threads across the GPU to reduce warp-level divergence. When threads in a warp take different branches (e.g., in :py:func:`dr.switch() ` statements or :ref:`vectorized virtual function calls `) performance can degrade significantly. SER can group threads with similar execution paths into coherent warps to avoid this. This feature is a no-op in LLVM mode. Example: .. code-block:: python import drjit as dr from drjit.cuda import Array3f, UInt32 arg = Array3f(...) # Prepare data and callable index callable_idx = UInt32(...) % 4 # 4 different callables # Reorder threads before dr.switch() to reduce divergence # The key uses 2 bits (for 4 callables) arg = dr.reorder_threads(key=callable_idx, num_bits=2, value=arg) # Now, threads with the same callable_idx are grouped together callables = [func0, func1, func2, func3] out = dr.switch(callable_idx, callables, arg) (Dr.Jit PR `#395 `__, Dr.Jit-Core PR `#145 `__). Related to this, the OptiX backend now requires the OptiX 8.0 ABI (specifically, ABI version 87). This is a requirement for SER. (Dr.Jit-Core PR `#117 `__). - **Random Number Generation API**: Introduced a new random number generation API around an abstract :py:class:`Generator ` object analogous to `NumPy `__. Under the hood, this API uses the :py:class:`Philox4x32 ` counter-based PRNG from `Salmon et al. [2011] `__, which provides high-quality random variates that are statistically independent within and across parallel streams. Users create generators with :py:func:`dr.rng() ` and call methods like :py:meth:`.random() ` and :py:meth:`.normal() `. Example: .. code-block:: python import drjit as dr from drjit.cuda import Float, TensorXf # Create a random number generator rng = dr.rng(seed=42) # Generate various random distributions uniform = rng.random(Float, 1000) # Uniform [0, 1) normal = rng.normal(Float, 1000) # Standard normal tensor = rng.random(TensorXf, (32, 32)) # Random tensor (PR `#417 `__). - **Array Resampling and Convolution**: Added :py:func:`dr.resample() ` for changing the resolution of arrays/tensors along specified axes, and :py:func:`dr.convolve() ` for convolution with continuous kernels. Both operations are fully differentiable and support various reconstruction filters (box, linear, cubic, lanczos, gaussian). Example: .. code-block:: python # Resample a 2D signal to different resolution data = dr.cuda.TensorXf(original_data) # Shape: (128, 128) upsampled = dr.resample( data, shape=(256, 256), # Target resolution filter='lanczos' # High-quality filter ) # Apply Gaussian blur via convolution blurred = dr.convolve( data, filter='gaussian', radius=2.0 ) (PRs `#358 `__, `#378 `__). - **Gradient-Based Optimizers**: Added an optimization framework that includes various standard optimizers inspired by PyTorch. It includes :py:class:`dr.opt.SGD ` with optional momentum and Nesterov acceleration, :py:class:`dr.opt.Adam ` with adaptive learning rates, and :py:class:`dr.opt.RMSProp `. The optimizers own the parameters and automatically handle mixed-precision training. An optional helper class :py:class:`dr.opt.GradScalar ` implements adaptive gradient scaling for low-precision training. .. code-block:: python from drjit.opt import Adam from drjit.cuda import Float # Create optimizer and register parameters opt = Adam(lr=1e-3) rng = dr.rng(seed=0) opt['params'] = Float(rng.normal(Float, 100)) # Optimization loop for unknown function f(x) for i in range(1000): # Fetch current parameters params = opt['params'] # Compute loss and gradients loss = f(params) # Some function to optimize dr.backward(loss) # Update parameters opt.step() (PRs `#345 `__, `#402 `__, commit `e3f576 `__). - **TensorFlow Interoperability**: Added TensorFlow interop via :py:func:`@dr.wrap `, supporting forward and backward automatic differentiation with comprehensive support for variables and tensors. (PR `#301 `__, contributed by `Jakob Hoydis `__). **Array and Tensor Operations** - Added :py:func:`dr.concat() ` to concatenate arrays/tensors along a specified axis following the Array API standard. (PR `#354 `__). - Added :py:func:`dr.take() ` and :py:func:`dr.take_interp() ` for efficient tensor indexing and interpolated indexing along specified axes. (PR `#420 `__, commit `b59436 `__). - Added :py:func:`dr.moveaxis() ` for rearranging tensor dimensions, providing NumPy-compatible axis movement. (commit `4d1478 `__). - Implemented comprehensive slice operations for regular (non-tensor) arrays, supporting advanced patterns like nested slices and integer array indexing. (PR `#365 `__). - Conversion between tensors and nested arrays (e.g. ``Array3f``) now offers an option (``flip_axis=True``) of whether or not to flip the axis order (e.g., `Nx3` vs `3xN`). (PR `#348 `__). **Performance Improvements** - Packet scatter-add operations now map to specialized GPU operations when supported by the hardware and driver. This change also broadens the situations where packet operations can be used on the CPU and GPU. Packets of size 6 were not supported in the past since their size was not a power of two. Now, they are treated as 3 separate size-2 packets. This feature is particularly helpful in combination with the new hash grid class, whose reverse-mode derivative generates atomic packet scatter-additions. (Dr.Jit-Core PR `#151 `__, Dr.Jit PR `#406 `__). - Enabled packet memory operations for texture access, providing speedups when accessing multi-channel textures on the LLVM and CUDA backends. (PR `#329 `__). - Optimized :py:func:`dr.rsqrt() ` to compile to faster instruction sequences on the LLVM backend using ``VRSQRTPS`` with Newton-Raphson iteration on Intel processors and similar optimizations for ARM Neon. (Dr.Jit PR `#343 `__, Dr.Jit-Core PR `#125 `__). - Made :py:func:`dr.any() `, :py:func:`dr.all() `, and :py:func:`dr.none() ` asynchronous with respect to the host, improving GPU utilization. (Dr.Jit PR `#344 `__, Dr.Jit-Core PR `#126 `__). **Random Number Generation (contd.)** - Added PCG32 reverse generation capabilities with ``prev_*`` methods for all random number generation functions for bidirectional traversal of random sequences. (PR `#398 `__). - Added PCG32 methods for generating normally distributed variates: :py:func:`PCG32.next_float_normal() `, :py:func:`PCG32.next_float32_normal() `, and :py:func:`PCG32.next_float64_normal() `. (PR `#353 `__). - Added :py:func:`dr.mul_wide() ` and :py:func:`dr.mul_hi() ` for wide integer multiplication, essential for implementing the Philox PRNG. (Dr.Jit PR `#414 `__, Dr.Jit-Core PR `#156 `__). **API Improvements** - Refined semantics of :py:func:`dr.forward_from() ` and :py:func:`dr.backward_from() ` to preserve existing gradients instead of unconditionally overriding them. (Dr.Jit PR `#351 `__). - Added utility functions :py:func:`dr.zeros_like() `, :py:func:`dr.ones_like() `, and :py:func:`dr.empty_like() `. (PR `#345 `__). - Added :py:meth:`dr.ArrayBase.item() ` method for extracting scalar values from single-element arrays/tensors, similar to NumPy/PyTorch. (commit `a142bc `__). - Added :py:func:`dr.linear_to_srgb() ` and :py:func:`dr.srgb_to_linear() ` for color space conversions. (commit `a7f138 `__). - Added :py:attr:`JitFlag.ForbidSynchronization` to catch costly synchronization operations during development. ( Dr.Jit PR `#350 `__, Dr.Jit-Core PR `#128 `__). - Added C++ bindings for thread-local memory arrays through the ``dr::Local`` template, complementing the existing Python functionality. This enables efficient scratch space and stack-like data structures in GPU kernels from C++ code. (commit `c30ade `__). **Notable Bugfixes** - Fixed ``dr::block_reduce()`` derivative computation for arrays not evenly divisible by block size. (commit `df79ed `__). - Fixed potential performance cliffs in :py:func:`dr.gather() ` by memoizing expressions and limiting expression growth (Dr.Jit-Core PR `#159 `__). - Fixed :py:func:`dr.rotate() ` quaternion component ordering to match C++ implementation. (PR `#416 `__). - Fixed the derivative of :py:func:`dr.unit_angle() ` at signed zero. (commit `9d09a9 `__). - Fixed memory leak in Python bindings using dedicated cleanup thread. (PR `#399 `__). - Preserve tensor shapes in symbolic operations. (commit `74c4d0 `__). - Fixed evaluated loop derivative issues with unchanged differentiable state variables. (commit `074cfe `__). - Fixed symbolic loop backward derivative compilation for simple loops. (commit `01ef10 `__). - Fixed broadcasting of tensors and handling of unknown objects in :py:func:`dr.select()