.. py:currentmodule:: drjit .. _changelog: Changelog ######### DrJit 1.4.0 (June 25, 2026) --------------------------- **Major new Features** - **Metal Backend**: Dr.Jit can now target Apple Silicon GPUs through a new Metal backend. It supports the full range of Dr.Jit features including symbolic control flow, automatic differentiation, hardware-accelerated ray tracing and textures, :ref:`cooperative vectors `, and reductions. (contributed by `Sébastien Speierer `__ and `Wenzel Jakob `__). - **Matrix Multiplication for Tensors**: The ``@`` operator and :py:func:`dr.matmul() ` now support tensors of any size and shape, fully replicating NumPy / PyTorch semantics including batched matrix products, broadcasting, matrix-vector products, and inner products. The operation is fully differentiable in both forward and reverse modes. Under the hood, this dispatches to efficient :ref:`block-tiled GEMM ` kernels shipped with Dr.Jit-Core. (Dr.Jit commit `183dc4 `__, Dr.Jit-Core PR `#188 `__, Dr.Jit-Core commits `0cca8d `__, `432ed4 `__, `444c8d `__, `4b8864 `__, `9e5335 `__). - **Generalized convolution and resampling**: The function :py:func:`dr.convolve() ` now handles discrete filter kernels besides continuous ones, making it a Dr.Jit substitute for :py:func:`numpy.convolve`. A new ``boundary`` parameter generalizes edge handling (``"zero"``, ``"nearest"``, ``"wrap"``, ``"reflect"``, or ``"mirror"``). A ``normalize`` flag toggles the renormalization of filter weights. The efficiency of both the forward pass and reverse-mode derivative was improved via a fast path for non-boundary outputs, and by switching to a fast transpose convolution instead of atomic scatters whenever possible. The new ``boundary`` argument is also available on :py:func:`dr.resample() `. - **Transpose**: Added :py:attr:`dr.ArrayBase.T ` and :py:attr:`dr.ArrayBase.mT `, matching PyTorch's semantics. (PR `#486 `__). - **Muon Optimizer**: Added :py:class:`dr.opt.Muon ` ("MomentUm Orthogonalized by Newton-schulz"), an optimizer for 2D hidden weights of neural networks. (commit `d205c1 `__). - **Redesign of the** :py:mod:`drjit.nn` **API**. Besides :ref:`cooperative vectors `, the :py:mod:`drjit.nn` API now also accepts regular tensors as inputs. Cooperative vectors fuse with surrounding computation, while tensor evaluation enables batched evaluation of large networks. See the :ref:`neural network documentation ` for details on both modes. Previously, it was necessary to extract the packed buffer copy and manually cast it between the working and optimizer precision. .. code-block:: python weights, net = nn.pack(net, layout='training') opt = Adam(lr=1e-3, params={'weights': Float32(weights)}) for i in range(n): weights[:] = Float16(opt['weights']) ... The new API exposes a cleaner interface that automates all of these steps: .. code-block:: python net = nn.pack(net, layout='training') opt = Adam(lr=1e-3) opt.update(net) for i in range(n): net.update(opt) ... :py:class:`nn.Module ` subclasses implement a :py:class:`MutableMapping ` keyed by the path to each parameter (e.g. ``'layers.0.weights'``). ``opt.update(net)`` pulls the parameters into the optimizer, while ``net.update(opt)`` pushes the updated state back. The :py:func:`nn.pack() ` function is now differentiable. This enables the use of Cooperative Vectors with matrix-level optimizers like :py:class:`Muon `. - **Reverse-mode differentiation of symbolic loops**: :py:func:`@dr.syntax ` ``while`` loops and symbolic :py:func:`dr.while_loop() ` calls are now differentiable in reverse mode via trajectory replay. See the :ref:`documentation ` for details. - **NumPy-style advanced tensor indexing**: Tensor indexing with multiple integer arrays now follows NumPy/PyTorch semantics. Previously, ``t[arr1, arr2]`` selected all combinations (a grid); it now selects element-wise pairs, matching ``torch`` and ``numpy`` behavior. Non-consecutive array indices (e.g., ``t[arr, :, arr]``) correctly broadcast and move the result dimension to the front of the output shape. - **NumPy-style array/tensor manipulation and sorting**: This release brings a large set NumPy-compatible functions for sorting, reshaping, and reindexing arrays and tensors. This includes :py:func:`dr.sort() `, :py:func:`dr.argsort() `, :py:func:`dr.argmin() ` and :py:func:`dr.argmax() ` which are backed by an efficient GPU-accelerated multi-bit radix sort. Other new shape manipulation functions include :py:func:`dr.expand_dims() `, :py:func:`dr.squeeze() `, :py:func:`dr.transpose() `, and :py:func:`dr.swapaxes() `. (PR `#496 `__). - **NumPy-consistent reductions**: The horizontal reductions (:py:func:`dr.sum() `, :py:func:`dr.prod() `, :py:func:`dr.min() `, :py:func:`dr.max() `, :py:func:`dr.mean() `, :py:func:`dr.all() `, :py:func:`dr.any() `, :py:func:`dr.none() `, :py:func:`dr.count() `, :py:func:`dr.reduce() `, :py:func:`dr.norm() `, :py:func:`dr.squared_norm() `) now mirror NumPy more closely by accepting a ``keepdims`` flag, with full tensor support. :py:func:`dr.norm() ` and :py:func:`dr.squared_norm() ` additionally gain the ``axis`` and ``mode`` parameters shared by the rest of the family. Finally, this release adds NumPy-compatible :py:func:`dr.var() ` and :py:func:`dr.std() ` functions. (PR `#493 `__). - **Test assertions**: Added :py:func:`dr.assert_allclose() `, an assertion utility for correctness checks in test cases that complements :py:func:`dr.allclose() `. (PR `#489 `__). **Performance Improvements** - **Tracing and evaluation**: A comprehensive optimization pass targeted Dr.Jit's tracing/code generation phases and Python bindings, making them roughly **twice as fast**. This will help workloads bottlenecked on tracing/Python-related overheads. (Dr.Jit commits `534829 `__, `3fba39 `__, `6b212c `__, `50986a `__, Dr.Jit-Core PR `#194 `__). - **Frozen function replay**: The :py:func:`@dr.freeze ` replay path was thoroughly optimized, accelerating it by up to ~2.5x. (Dr.Jit commits `ff09ee `__, `c1282c `__, `13fe80 `__). - **Faster function calls**: Dr.Jit now generates much better code for indirect function calls in kernels (e.g., method calls on arrays of object instances, :py:func:`dr.switch() `, and :py:func:`dr.dispatch() `). The per-instance data of all callables is now merged into a single per-kernel buffer and fetched using vectorized packet loads, rather than being scattered across many small buffers and read element by element. On the LLVM backend, call inputs and outputs are additionally passed in registers rather than stack scratch space, which reduces memory traffic and improves performance. Dr.Jit also uses more efficient data structures to collect call data, which speeds up the compilation of kernels that dispatch to a large number of instances. (Dr.Jit-Core commits `1ed505 `__, `bc6d9c `__, `69120f `__, `83207d `__). - **LLVM code generation**: Load/store aliasing metadata was improved so that non-conflicting memory operations within a kernel can be freely reordered or hoisted out of loops, which improves performance of kernels on the LLVM backend. (Dr.Jit-Core commit `84c85b `__). - **Warp-reduction for packet scatter-reduce**: On the CUDA and Metal backends, :py:func:`dr.scatter_reduce() ` now provides a *packet-aware* reduction path that jointly reduces values within the warp/simdgroup before issuing scalar or packet atomics depending on hardware/driver support. (Dr.Jit-Core PR `#190 `__). - **nanobind optimizations**: Dr.Jit benefits from optimizations introduced in `nanobind v2.13 `__. This release adds *instance pooling*, which provides a cache to recycle short-lived objects. Dr.Jit opts into this feature to accelerate tracing, which generates large amounts of temporaries. Other optimizations target object creation/destruction and nd-array exchange. (Dr.Jit commit `6b212c `__, nanobind PRs `#1366 `__, `#1374 `__, `#1375 `__). - **nanothread optimizations**: The thread pool driving parallel evaluation was improved: - **Faster worker wake-up**: idle worker threads busy-poll for a short while and then go to sleep to avoid wasting power. The new version of nanothread is more careful to wake only the required number of threads, and it does so using efficient OS primitives, such as `futex `__ on Linux (commits `73efa1 `__, `366774 `__). - **Worker count**: the main thread now "counts" as a member of the thread pool, since it pitches in when waiting for work. On Apple silicon, workers now only run on "performance cores", as parallelization over "efficiency cores" tends to add tail latency that slows down parallel workloads. (commits `03cacd `__, `348404 `__. `e68a4d `__, `098925 `__, `beca8c `__). - **Fixed timing glitches**: timing information reported by :py:func:`dr.kernel_history() ` would occasionally report nonsensical values close to ``2^64`` due to a race condition that is now fixed. (`f11692 `__). **Minor features** - **CUDA Green Context API**: Added :py:class:`drjit.cuda.green_context`, a context manager that isolates kernels to a subset of the GPU's streaming multiprocessors. See the :ref:`green context documentation ` for details. (Dr.Jit commit `6c69ec `__, Dr.Jit-Core commit `d4f1a6 `__). - **Command queue flushing**: The new :py:func:`dr.flush_thread() ` function flushes queued work to the GPU, which is needed for multi-threaded use of Dr.Jit on the Metal backend. (Dr.Jit commit `c68e00 `__, Dr.Jit-Core commit `467dd3 `__). **Bug Fixes** - Fixed a bug in :py:meth:`dr.rng().integers() ` where a symbolic loop was misused, producing invalid LLVM IR. (commit `f7054e `__). - Fixed a variable shadowing bug in ``_flatten``/``_unflatten`` that caused crashes when flattening PyTrees containing custom ``DRJIT_STRUCT`` types. (PR `#482 `__). - Fixed a bug in :py:class:`nn.SinEncode ` where the per-octave phase offset did not match the documented formula. Code using ``shift=0`` is unaffected. - Fixed incorrect type names in :py:func:`dr.graphviz_ad() `. (commit `0c685e `__). - Fixed minor memory leaks due to recorded/frozen kernels. (Dr.Jit-Core commit `f0bf64 `__). - Fixed memory leaks related to kernel histories. (Dr.Jit-Core commit `318e55 `__). - Renamed the conflicting ``KernelRecordingMode.None`` enumerator to ``Inactive`` to avoid the collision with Python's ``None``. (Dr.Jit-Core PR `#186 `__, Dr.Jit PR `#481 `__). - Fixed several issues involving symbolic loops with aliased state variables. (Dr.Jit PRs `#505 `__, `#510 `__, Dr.Jit-Core PR `#198 `__, contributed by `Lovro Nuic `__). - Fixed half-precision ``Min``/``Max`` reductions and the half-precision infinity constant. (Dr.Jit-Core PR `#199 `__). - Various smaller backend fixes: a missing mask predicate in the CUDA packet ``scatter_reduce`` path, a crash in Metal cooperative-vector matrix-vector products with unsupported output dimensions, a race condition under multi-threaded Metal use, incorrect fast-math flag handling on ``Sqrt`` and ``Div`` nodes, and more robust handling of failed ``jit_eval()`` calls. (Dr.Jit-Core PRs `#191 `__, `#200 `__, `#196 `__, `#192 `__, commits `368c53 `__, `37bbce `__, Dr.Jit PR `#503 `__). **Other Improvements** - Improved documentation and error messages when the Dr.Jit binary fails to load. (PR `#485 `__). - Various improvements to Dr.Jit's static type annotations: added missing stubs for :py:func:`dr.mean() `, added type hints for ``PrefixRedOp``, and minor stub pattern replacement rule fixes. (PRs `#478 `__, `#480 `__, `#483 `__). - **Release the GIL while waiting for kernel history**: Retrieving timing data via :py:func:`dr.kernel_history() ` now releases the GIL while waiting for the asynchronous results to arrive, allowing other Python threads to make progress in the meantime. (commits `766e1e `__, `f90bfd `__). - **ndarray Cleanup**: ndarray reclamation previously always went through an asynchronous cleanup thread. This detour is now skipped for CUDA and Metal arrays when the calling thread already holds the GIL. (commit `c01a23 `__). **API Breaks and Device Compatibility** - ⚠️ :py:func:`nn.pack() ` and :py:func:`nn.unpack() ` **no longer return the shared buffer as the first element of the result tuple**. They now return only the packed/unpacked PyTree with matrix views in place of the input tensors. The underlying buffer remains available via the :py:attr:`MatrixView.buffer ` attribute, or, for a packed :py:class:`nn.Module `, via the ``'weights'`` entry of the module's mapping interface (i.e. ``net['weights']``). Migration: .. code-block:: python # Before buffer, A_view, b_view = nn.pack(A, b, layout='training') dr.enable_grad(buffer) # After A_view, b_view = nn.pack(A, b, layout='training') dr.enable_grad(A_view.buffer) For a packed :py:class:`nn.Module `: .. code-block:: python # Before buffer, net = nn.pack(net, layout='training') dr.enable_grad(buffer) # After net = nn.pack(net, layout='training') dr.enable_grad(net['weights']) - ⚠️ **Removed TensorFlow support**. TensorFlow appears largely unmaintained. Over a year after the launch of NVIDIA's Blackwell GPU generation, there is still no official support in the official TensorFlow packages. This is a maintenance burden as our CI infrastructure uses this GPU. Consequently, we decided to drop Tensorflow support (``.tf()`` conversion, support in :py:func:`@dr.wrap `). - ⚠️ **Removed Kahan-compensated atomic scatter**. The ``drjit.scatter_add_kahan()`` operation was removed. See commit `f6b4be `__ for the rationale. - ⚠️ **Compute capability**. Dr.Jit-Core's CUDA backend now requires compute capability **7.5 or higher** (Turing and later) and NVIDIA driver **R535 or newer**. (Dr.Jit-Core PR `#188 `__). DrJit 1.3.1 (February 23, 2026) ------------------------------- **Bug Fixes** - Fixed LLVM library search paths to include ``aarch64`` and WSL-specific directories. This resolves failures to locate LLVM on ARM Linux systems and Windows Subsystem for Linux. (Dr.Jit-Core PR `#185 `__). - Fixed ordering of CUDA forward declarations of callables, resolving cases where a forward declaration could appear after the actual function definition. (Dr.Jit-Core commit `213983 `__). DrJit 1.3.0 (February 16, 2026) ------------------------------- **New Features** - **Atomic Scatter Operations**: Added :py:func:`dr.scatter_cas() ` (atomic compare-and-swap) and :py:func:`dr.scatter_exch() ` (atomic exchange) operations. On the CUDA backend, these map to native PTX instructions; the LLVM implementation uses a loop over the vectorization width. (Dr.Jit PR `#450 `__, Dr.Jit-Core PR `#177 `__). - **AdamW Optimizer**: Added the :py:class:`dr.opt.AdamW ` optimizer with built-in weight decay, equivalent to PyTorch's implementation. (PR `#449 `__). - **AMSGrad for Adam/AdamW**: The :py:class:`dr.opt.Adam ` and :py:class:`dr.opt.AdamW ` optimizers now support an optional ``amsgrad`` parameter. AMSGrad keeps a running maximum of the second moments, which can help improve stability near local minima. (PR `#467 `__). - **Functions in IR** :py:func:`dr.func`: A new function decorator that forces a Python function to also become a callable in the generated IR. This can improve compilation times: without it, Dr.Jit emits the function body's IR every time it is called within a single kernel. With ``@dr.func``, each call resolves to a function call in the IR, emitting the body only once. (Dr.Jit PR `#473 `__, Dr.Jit-Core PR `#183 `__). - **Oklab Color Space Conversion**: Added :py:func:`dr.linear_srgb_to_oklab() ` and :py:func:`dr.oklab_to_linear_srgb() ` for perceptually uniform color space conversion. (PR `#453 `__). - **Pickling Support**: Dr.Jit arrays can now be natively pickled and unpickled via Python's ``pickle`` module. (PR `#448 `__). - **Bounded Integer RNG**: Added :py:meth:`dr.rng().integers() ` to generate uniformly distributed integers on a given interval. (commit `cb09ca `__). - **Symbolic RNG mode**: :py:func:`dr.rng() ` now accepts a ``symbolic`` argument for a purely symbolic sampler. (commit `51bacb `__). - **ArrayX Initialization from Tensors**: Nested array types with multiple dynamic dimensions (like ``ArrayXf``) can now be initialized from Dr.Jit tensors or NumPy arrays. (commit `e7e133 `__). - **Type Trait**: Added :py:func:`dr.replace_shape_t() ` convenience type trait for writing generic functions that need to reshape array types. (commit `46b245 `__). **Hardware/platform-specfic features** - **NVIDIA Blackwell (SM120+)**: Added support for wide packet loads, gathers, and atomics on NVIDIA Blackwell GPUs (SM120+). (commit `879c10 `__). - **Python 3.14 Compatibility**: Fixed compatibility with PEP 649 deferred annotation evaluation, ensuring Dr.Jit works correctly on Python 3.14. (commit `7fa6eb `__). - **Linux ARM Wheels**: Added ``ubuntu-24.04-arm`` to the wheels pipeline. (PR `#461 `__, contributed by `Merlin Nimier-David `__). **Performance Improvements** - **Simplified Single-Target Virtual Calls**: When a virtual function call has only a single target (as is the case for ``@dr.func``), the JIT backend now eliminates the indirection/dispatch loop and calls the function directly, producing simpler IR. (Dr.Jit-Core PR `#183 `__). - **AD Early Exit for Zero Derivatives**: The AD graph traversal now skips edges with zero-valued derivatives, avoiding unnecessary computation. (commit `06b0a9 `__). - **GIL Release in __getitem__**: ``dr.ArrayBase.__getitem__()`` now releases the GIL while waiting, improving multi-threaded performance. (commit `c24be7 `__). **Bug Fixes** - Fixed a bug where constructing a cooperative vector inside a ``dr.suspend_grad()`` scope could raise an exception. (PR `#475 `__, contributed by `Christian Döring `__). - Fixed a crash when calling a frozen function with a re-seeded random number generator whose seed was a Python integer type. (PR `#471 `__, contributed by `Christian Döring `__). - Fixed a bug in the C++ ``transform_compose()`` function where the translation was placed in the last row of the matrix rather than the last column. (PR `#451 `__, contributed by `Delio Vicini `__). - Fixed multiple issues in the Dr.Jit-Core ``gather`` re-indexing logic: the mask stack is now correctly applied during re-indexing, and nested gather masks are combined rather than overwritten. (Dr.Jit-Core PR `#178 `__). - Fixed a bug in virtual call analysis when a target contained a symbolic loop — the analysis now accounts for eliminated/optimized-out loop state variables. (Dr.Jit-Core PR `#184 `__). - Fixed LLVM backend compilation of wavefront loops with scalar masks. (commit `16a81d `__). - Fixed lost tensor shapes when a loop or conditional is replayed for AD passes, with more robust inference of tensor output shapes. (commit `9d201f `__). - Fixed a regression in ``ArrayX`` initialization from tensors and NumPy ndarrays (wrong shape hint order for flipped axes and broken shift loop). (commit `df4cf4 `__). - Fixed ``Texture::eval_fetch_cuda`` to handle double-precision queries gracefully by casting to single-precision when a HW-accelerated texture is requested. (commits `83083d `__, `054d11 `__). - Fixed symbolic loop size computation to also account for side-effect sizes. (Dr.Jit-Core commit `c6dfc8 `__). - Fixed spurious warning when freezing functions with very wide literals. (PR `#455 `__). **Other Improvements** - Updated to nanobind `v2.10.2 `__. - Improved documentation and log messages for textures, including clarifications regarding numerical precision and extra diagnostics for migrated textures. (commit `4edae0 `__). DrJit 1.2.0 (September 17, 2025) -------------------------------- **New Features** - **Event API**: Added an event API for fine-grained timing and synchronization of GPU kernels. This enables more detailed performance profiling and better control over asynchronous operations. (Dr.Jit PR `#441 `__, Dr.Jit-Core PR `#174 `__). - **OpenGL Interoperability**: Improved CUDA-OpenGL interoperability with simplified APIs. This enables efficient sharing of data between CUDA kernels and OpenGL rendering. (Dr.Jit PR `#429 `__, Dr.Jit-Core PR `#164 `__, contributed by `Merlin Nimier-David `__). - **Enhanced Int8/UInt8 Support**: Improved support for 8-bit integer types with better casting and bitcast operations. (Dr.Jit PR `#428 `__, Dr.Jit-Core PR `#163 `__, contributed by `Merlin Nimier-David `__). **Performance Improvements** - **Register Spilling to Shared Memory**: CUDA backend now supports spilling registers to shared memory, improving performance for kernels with high register pressure. (Dr.Jit-Core commit `5cf6d3 `__). - **Memory View Support**: Arrays can now be converted to Python ``memoryview`` objects for efficient zero-copy data access. (commit `b70391 `__). - **DLPack GIL Release**: The ``dr.ArrayBase.dlpack()`` method now releases the GIL while waiting, improving multi-threaded performance. (commit `0adf9b `__). - **Thread Synchronization**: ``dr.sync_thread()`` now releases the GIL while waiting, preventing unnecessary blocking in multi-threaded applications. (commit `956d2f `__). **API Improvements** - **Spherical Direction Utilities**: Added Python implementation of spherical direction utilities (``dr.sphdir``). (PR `#432 `__, contributed by `Sébastien Speierer `__). - **Matrix Conversions**: Added support for converting between 3D and 4D matrices: ``Matrix4f`` can be constructed from a 3D matrix and ``Matrix3f`` from a 4D matrix. (commit `7f8ea8 `__). - **Quaternion API**: Improved the quaternion Python API for better usability and consistency. (commit `282da8 `__). - **Type casts**: Allow casting between Dr.Jit types to properly allow AD<->non-AD conversions when required. (commit `72f1e6 `__). **Bug Fixes** - Fixed deadlock issues in ``@dr.freeze`` decorator. (commit `e8fc55 `__). - Fixed gradient tracking in ``Texture.tensor()`` to ensure gradients are never dropped inadvertently. (PR `#444 `__). - Fixed AD support for C++ ``repeat`` and ``tile`` operations with proper gradient propagation. (commits `fd6930 `__, `282da8 `__). - Fixed Python object traversal to check that ``__dict__`` exists before accessing it, preventing crashes with certain object types. (commit `433ada `__). - Fixed symbolic loop size calculation to properly account for side-effects. (Dr.Jit-Core commit `31bf91 `__). - Fixed read-after-free issue in OptiX SBT data loading. (Dr.Jit-Core commit `009ade `__, contributed by `Merlin Nimier-David `__). **Other Improvements** - Updated to nanobind `v2.9.2 `__ - Improved error messages by adding function names to vectorized call errors. (Dr.Jit-Core PR `#165 `__, contributed by `Sébastien Speierer `__). - Added missing checks for JIT leak warnings. (Dr.Jit-Core PR `#166 `__, contributed by `Sébastien Speierer `__). - Added warning for LLVM API initialization failures. (Dr.Jit-Core PR `#168 `__, contributed by `Sébastien Speierer `__). - Fixed pytest warnings and improved test infrastructure. (PR `#436 `__). DrJit 1.1.0 (August 7, 2025) ---------------------------- The v1.1.0 release of Dr.Jit includes several major new features: **Major Features** - **Cooperative Vectors**: Dr.Jit now provides an API to efficiently evaluate matrix-vector products in parallel programs. The API targets small matrices (e.g., 128x128, 64×64, or smaller) and inlines all computation into the program. Threads work cooperatively to perform these operations efficiently. On NVIDIA GPUs (Turing or newer), this leverages the OptiX cooperative vector API with tensor core acceleration. On the LLVM backend, operations compile to sequences of packet instructions (e.g., AVX512). See the :ref:`cooperative vector documentation ` for more details. Example: .. code-block:: python import drjit as dr import drjit.nn as nn from drjit.cuda.ad import Float16, TensorXf16 # Create a random number generator rng = dr.rng(seed=0) # Create a matrix and bias representing an affine transformation A = rng.normal(TensorXf16, shape=(3, 16)) # 3×16 matrix b = TensorXf16([1, 2, 3]) # Bias vector # Pack into optimized memory layout buffer, A_view, b_view = nn.pack(A, b) # Create cooperative a vector from 16 inputs vec_in = nn.CoopVec(Float16(1), Float16(2), ...) # Perform matrix-vector multiplication: A @ vec_in + b vec_out = nn.matvec(A_view, vec_in, b_view) # Unpack result back to regular arrays x, y, z = vec_out (Dr.Jit PR `#384 `__, Dr.Jit-Core PR `#141 `__). - **Neural Network Library**: Building on the cooperative vector functionality, the new :py:mod:`drjit.nn` module provides modular abstractions for constructing, evaluating, and optimizing neural networks, similar to PyTorch's ``nn.Module``. This enables fully fused evaluation of small multilayer perceptrons (MLPs) within larger programs. See the :ref:`neural network module documentation ` for more details. Example: .. code-block:: python import drjit.nn as nn from drjit.cuda.ad import TensorXf16, Float16 # Define a small MLP for function approximation net = nn.Sequential( nn.SinEncode(16), # Sinusoidal encoding nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, 3, bias=False), # Output layer (3 outputs) nn.Tanh() ) # Instantiate and optimize for 16-bit tensor cores rng = dr.rng(seed=0) net = net.alloc(dtype=TensorXf16, size=2, rng=rng) weights, net = nn.pack(net, layout='training') # Evaluate the network inputs = nn.CoopVec(Float16(0.5), Float16(0.7)) outputs = net(inputs) x, y, z = outputs # Three output values (PR `#384 `__). - **Hash Grid Encoding**: Added neural network hash grid encoding inspired by `Instant NGP `__, providing multi-resolution spatial encodings. This includes both traditional hash grids and `permutohedral encodings `__ for efficient high-dimensional inputs. (PR `#390 `__, contributed by `Christian Döring `__ and `Merlin Nimier-David `__). - **Function Freezing**: Added the :py:func:`@dr.freeze ` decorator to eliminate repeated tracing overhead by caching and replaying JIT-compiled kernels. Dr.Jit normally traces operations to build computation graphs for compilation, which can become a bottleneck when the same complex computation is performed repeatedly (e.g., in optimization loops). The decorator records kernel launches on the first call and replays them directly on subsequent calls, avoiding re-tracing. This can dramatically accelerate programs and makes Dr.Jit usable for realtime rendering and other applications with strict timing requirements. See the :ref:`function freezing documentation ` for more details. Example: .. code-block:: python import drjit as dr from drjit.cuda import Float, UInt32 # Without freezing - traces every time def func(x): y = seriously_complicated_code(x) dr.eval(y) # ..intermediate evaluations.. return huge_function(y, x) # With freezing - traces only once @dr.freeze def frozen(x): ... # same code as above -- no changes needed (Dr.Jit PR `#336 `__, Dr.Jit-Core PR `#107 `__, contributed by `Christian Döring `__). - **Shader Execution Reordering (SER)**: Added the function :py:func:`dr.reorder_threads() ` to shuffle threads across the GPU to reduce warp-level divergence. When threads in a warp take different branches (e.g., in :py:func:`dr.switch() ` statements or :ref:`vectorized virtual function calls `) performance can degrade significantly. SER can group threads with similar execution paths into coherent warps to avoid this. This feature is a no-op in LLVM mode. Example: .. code-block:: python import drjit as dr from drjit.cuda import Array3f, UInt32 arg = Array3f(...) # Prepare data and callable index callable_idx = UInt32(...) % 4 # 4 different callables # Reorder threads before dr.switch() to reduce divergence # The key uses 2 bits (for 4 callables) arg = dr.reorder_threads(key=callable_idx, num_bits=2, value=arg) # Now, threads with the same callable_idx are grouped together callables = [func0, func1, func2, func3] out = dr.switch(callable_idx, callables, arg) (Dr.Jit PR `#395 `__, Dr.Jit-Core PR `#145 `__). Related to this, the OptiX backend now requires the OptiX 8.0 ABI (specifically, ABI version 87). This is a requirement for SER. (Dr.Jit-Core PR `#117 `__). - **Random Number Generation API**: Introduced a new random number generation API around an abstract :py:class:`Generator ` object analogous to `NumPy `__. Under the hood, this API uses the :py:class:`Philox4x32 ` counter-based PRNG from `Salmon et al. [2011] `__, which provides high-quality random variates that are statistically independent within and across parallel streams. Users create generators with :py:func:`dr.rng() ` and call methods like :py:meth:`.random() ` and :py:meth:`.normal() `. Example: .. code-block:: python import drjit as dr from drjit.cuda import Float, TensorXf # Create a random number generator rng = dr.rng(seed=42) # Generate various random distributions uniform = rng.random(Float, 1000) # Uniform [0, 1) normal = rng.normal(Float, 1000) # Standard normal tensor = rng.random(TensorXf, (32, 32)) # Random tensor (PR `#417 `__). - **Array Resampling and Convolution**: Added :py:func:`dr.resample() ` for changing the resolution of arrays/tensors along specified axes, and :py:func:`dr.convolve() ` for convolution with continuous kernels. Both operations are fully differentiable and support various reconstruction filters (box, linear, cubic, lanczos, gaussian). Example: .. code-block:: python # Resample a 2D signal to different resolution data = dr.cuda.TensorXf(original_data) # Shape: (128, 128) upsampled = dr.resample( data, shape=(256, 256), # Target resolution filter='lanczos' # High-quality filter ) # Apply Gaussian blur via convolution blurred = dr.convolve( data, filter='gaussian', radius=2.0 ) (PRs `#358 `__, `#378 `__). - **Gradient-Based Optimizers**: Added an optimization framework that includes various standard optimizers inspired by PyTorch. It includes :py:class:`dr.opt.SGD ` with optional momentum and Nesterov acceleration, :py:class:`dr.opt.Adam ` with adaptive learning rates, and :py:class:`dr.opt.RMSProp `. The optimizers own the parameters and automatically handle mixed-precision training. An optional helper class :py:class:`dr.opt.GradScalar ` implements adaptive gradient scaling for low-precision training. .. code-block:: python from drjit.opt import Adam from drjit.cuda import Float # Create optimizer and register parameters opt = Adam(lr=1e-3) rng = dr.rng(seed=0) opt['params'] = Float(rng.normal(Float, 100)) # Optimization loop for unknown function f(x) for i in range(1000): # Fetch current parameters params = opt['params'] # Compute loss and gradients loss = f(params) # Some function to optimize dr.backward(loss) # Update parameters opt.step() (PRs `#345 `__, `#402 `__, commit `e3f576 `__). - **TensorFlow Interoperability**: Added TensorFlow interop via :py:func:`@dr.wrap `, supporting forward and backward automatic differentiation with comprehensive support for variables and tensors. (PR `#301 `__, contributed by `Jakob Hoydis `__). **Array and Tensor Operations** - Added :py:func:`dr.concat() ` to concatenate arrays/tensors along a specified axis following the Array API standard. (PR `#354 `__). - Added :py:func:`dr.take() ` and :py:func:`dr.take_interp() ` for efficient tensor indexing and interpolated indexing along specified axes. (PR `#420 `__, commit `b59436 `__). - Added :py:func:`dr.moveaxis() ` for rearranging tensor dimensions, providing NumPy-compatible axis movement. (commit `4d1478 `__). - Implemented comprehensive slice operations for regular (non-tensor) arrays, supporting advanced patterns like nested slices and integer array indexing. (PR `#365 `__). - Conversion between tensors and nested arrays (e.g. ``Array3f``) now offers an option (``flip_axis=True``) of whether or not to flip the axis order (e.g., `Nx3` vs `3xN`). (PR `#348 `__). **Performance Improvements** - Packet scatter-add operations now map to specialized GPU operations when supported by the hardware and driver. This change also broadens the situations where packet operations can be used on the CPU and GPU. Packets of size 6 were not supported in the past since their size was not a power of two. Now, they are treated as 3 separate size-2 packets. This feature is particularly helpful in combination with the new hash grid class, whose reverse-mode derivative generates atomic packet scatter-additions. (Dr.Jit-Core PR `#151 `__, Dr.Jit PR `#406 `__). - Enabled packet memory operations for texture access, providing speedups when accessing multi-channel textures on the LLVM and CUDA backends. (PR `#329 `__). - Optimized :py:func:`dr.rsqrt() ` to compile to faster instruction sequences on the LLVM backend using ``VRSQRTPS`` with Newton-Raphson iteration on Intel processors and similar optimizations for ARM Neon. (Dr.Jit PR `#343 `__, Dr.Jit-Core PR `#125 `__). - Made :py:func:`dr.any() `, :py:func:`dr.all() `, and :py:func:`dr.none() ` asynchronous with respect to the host, improving GPU utilization. (Dr.Jit PR `#344 `__, Dr.Jit-Core PR `#126 `__). **Random Number Generation (contd.)** - Added PCG32 reverse generation capabilities with ``prev_*`` methods for all random number generation functions for bidirectional traversal of random sequences. (PR `#398 `__). - Added PCG32 methods for generating normally distributed variates: :py:func:`PCG32.next_float_normal() `, :py:func:`PCG32.next_float32_normal() `, and :py:func:`PCG32.next_float64_normal() `. (PR `#353 `__). - Added :py:func:`dr.mul_wide() ` and :py:func:`dr.mul_hi() ` for wide integer multiplication, essential for implementing the Philox PRNG. (Dr.Jit PR `#414 `__, Dr.Jit-Core PR `#156 `__). **API Improvements** - Refined semantics of :py:func:`dr.forward_from() ` and :py:func:`dr.backward_from() ` to preserve existing gradients instead of unconditionally overriding them. (Dr.Jit PR `#351 `__). - Added utility functions :py:func:`dr.zeros_like() `, :py:func:`dr.ones_like() `, and :py:func:`dr.empty_like() `. (PR `#345 `__). - Added :py:meth:`dr.ArrayBase.item() ` method for extracting scalar values from single-element arrays/tensors, similar to NumPy/PyTorch. (commit `a142bc `__). - Added :py:func:`dr.linear_to_srgb() ` and :py:func:`dr.srgb_to_linear() ` for color space conversions. (commit `a7f138 `__). - Added :py:attr:`JitFlag.ForbidSynchronization` to catch costly synchronization operations during development. ( Dr.Jit PR `#350 `__, Dr.Jit-Core PR `#128 `__). - Added C++ bindings for thread-local memory arrays through the ``dr::Local`` template, complementing the existing Python functionality. This enables efficient scratch space and stack-like data structures in GPU kernels from C++ code. (commit `c30ade `__). **Notable Bugfixes** - Fixed ``dr::block_reduce()`` derivative computation for arrays not evenly divisible by block size. (commit `df79ed `__). - Fixed potential performance cliffs in :py:func:`dr.gather() ` by memoizing expressions and limiting expression growth (Dr.Jit-Core PR `#159 `__). - Fixed :py:func:`dr.rotate() ` quaternion component ordering to match C++ implementation. (PR `#416 `__). - Fixed the derivative of :py:func:`dr.unit_angle() ` at signed zero. (commit `9d09a9 `__). - Fixed memory leak in Python bindings using dedicated cleanup thread. (PR `#399 `__). - Preserve tensor shapes in symbolic operations. (commit `74c4d0 `__). - Fixed evaluated loop derivative issues with unchanged differentiable state variables. (commit `074cfe `__). - Fixed symbolic loop backward derivative compilation for simple loops. (commit `01ef10 `__). - Fixed broadcasting of tensors and handling of unknown objects in :py:func:`dr.select()