Changelog

DrJit 1.4.0 (June 25, 2026)

Major new Features

  • Metal Backend: Dr.Jit can now target Apple Silicon GPUs through a new Metal backend. It supports the full range of Dr.Jit features including symbolic control flow, automatic differentiation, hardware-accelerated ray tracing and textures, cooperative vectors, and reductions. (contributed by Sébastien Speierer and Wenzel Jakob).

  • Matrix Multiplication for Tensors: The @ operator and dr.matmul() now support tensors of any size and shape, fully replicating NumPy / PyTorch semantics including batched matrix products, broadcasting, matrix-vector products, and inner products. The operation is fully differentiable in both forward and reverse modes. Under the hood, this dispatches to efficient block-tiled GEMM kernels shipped with Dr.Jit-Core. (Dr.Jit commit 183dc4, Dr.Jit-Core PR #188, Dr.Jit-Core commits 0cca8d, 432ed4, 444c8d, 4b8864, 9e5335).

  • Generalized convolution and resampling: The function dr.convolve() now handles discrete filter kernels besides continuous ones, making it a Dr.Jit substitute for numpy.convolve(). A new boundary parameter generalizes edge handling ("zero", "nearest", "wrap", "reflect", or "mirror"). A normalize flag toggles the renormalization of filter weights. The efficiency of both the forward pass and reverse-mode derivative was improved via a fast path for non-boundary outputs, and by switching to a fast transpose convolution instead of atomic scatters whenever possible. The new boundary argument is also available on dr.resample().

  • Transpose: Added dr.ArrayBase.T and dr.ArrayBase.mT, matching PyTorch’s semantics. (PR #486).

  • Muon Optimizer: Added dr.opt.Muon (“MomentUm Orthogonalized by Newton-schulz”), an optimizer for 2D hidden weights of neural networks. (commit d205c1).

  • Redesign of the drjit.nn API. Besides cooperative vectors, the drjit.nn API now also accepts regular tensors as inputs. Cooperative vectors fuse with surrounding computation, while tensor evaluation enables batched evaluation of large networks. See the neural network documentation for details on both modes.

    Previously, it was necessary to extract the packed buffer copy and manually cast it between the working and optimizer precision.

    weights, net = nn.pack(net, layout='training')
    opt = Adam(lr=1e-3, params={'weights': Float32(weights)})
    
    for i in range(n):
        weights[:] = Float16(opt['weights'])
        ...
    

    The new API exposes a cleaner interface that automates all of these steps:

    net = nn.pack(net, layout='training')
    opt = Adam(lr=1e-3)
    opt.update(net)
    
    for i in range(n):
        net.update(opt)
        ...
    

    nn.Module subclasses implement a MutableMapping keyed by the path to each parameter (e.g. 'layers.0.weights'). opt.update(net) pulls the parameters into the optimizer, while net.update(opt) pushes the updated state back. The nn.pack() function is now differentiable. This enables the use of Cooperative Vectors with matrix-level optimizers like Muon.

  • Reverse-mode differentiation of symbolic loops: @dr.syntax while loops and symbolic dr.while_loop() calls are now differentiable in reverse mode via trajectory replay. See the documentation for details.

  • NumPy-style advanced tensor indexing: Tensor indexing with multiple integer arrays now follows NumPy/PyTorch semantics. Previously, t[arr1, arr2] selected all combinations (a grid); it now selects element-wise pairs, matching torch and numpy behavior. Non-consecutive array indices (e.g., t[arr, :, arr]) correctly broadcast and move the result dimension to the front of the output shape.

  • NumPy-style array/tensor manipulation and sorting: This release brings a large set NumPy-compatible functions for sorting, reshaping, and reindexing arrays and tensors. This includes dr.sort(), dr.argsort(), dr.argmin() and dr.argmax() which are backed by an efficient GPU-accelerated multi-bit radix sort. Other new shape manipulation functions include dr.expand_dims(), dr.squeeze(), dr.transpose(), and dr.swapaxes(). (PR #496).

  • NumPy-consistent reductions: The horizontal reductions (dr.sum(), dr.prod(), dr.min(), dr.max(), dr.mean(), dr.all(), dr.any(), dr.none(), dr.count(), dr.reduce(), dr.norm(), dr.squared_norm()) now mirror NumPy more closely by accepting a keepdims flag, with full tensor support. dr.norm() and dr.squared_norm() additionally gain the axis and mode parameters shared by the rest of the family. Finally, this release adds NumPy-compatible dr.var() and dr.std() functions. (PR #493).

  • Test assertions: Added dr.assert_allclose(), an assertion utility for correctness checks in test cases that complements dr.allclose(). (PR #489).

Performance Improvements

  • Tracing and evaluation: A comprehensive optimization pass targeted Dr.Jit’s tracing/code generation phases and Python bindings, making them roughly twice as fast. This will help workloads bottlenecked on tracing/Python-related overheads. (Dr.Jit commits 534829, 3fba39, 6b212c, 50986a, Dr.Jit-Core PR #194).

  • Frozen function replay: The @dr.freeze replay path was thoroughly optimized, accelerating it by up to ~2.5x. (Dr.Jit commits ff09ee, c1282c, 13fe80).

  • Faster function calls: Dr.Jit now generates much better code for indirect function calls in kernels (e.g., method calls on arrays of object instances, dr.switch(), and dr.dispatch()). The per-instance data of all callables is now merged into a single per-kernel buffer and fetched using vectorized packet loads, rather than being scattered across many small buffers and read element by element. On the LLVM backend, call inputs and outputs are additionally passed in registers rather than stack scratch space, which reduces memory traffic and improves performance. Dr.Jit also uses more efficient data structures to collect call data, which speeds up the compilation of kernels that dispatch to a large number of instances. (Dr.Jit-Core commits 1ed505, bc6d9c, 69120f, 83207d).

  • LLVM code generation: Load/store aliasing metadata was improved so that non-conflicting memory operations within a kernel can be freely reordered or hoisted out of loops, which improves performance of kernels on the LLVM backend. (Dr.Jit-Core commit 84c85b).

  • Warp-reduction for packet scatter-reduce: On the CUDA and Metal backends, dr.scatter_reduce() now provides a packet-aware reduction path that jointly reduces values within the warp/simdgroup before issuing scalar or packet atomics depending on hardware/driver support. (Dr.Jit-Core PR #190).

  • nanobind optimizations: Dr.Jit benefits from optimizations introduced in nanobind v2.13. This release adds instance pooling, which provides a cache to recycle short-lived objects. Dr.Jit opts into this feature to accelerate tracing, which generates large amounts of temporaries. Other optimizations target object creation/destruction and nd-array exchange. (Dr.Jit commit 6b212c, nanobind PRs #1366, #1374, #1375).

  • nanothread optimizations: The thread pool driving parallel evaluation was improved:

    • Faster worker wake-up: idle worker threads busy-poll for a short while and then go to sleep to avoid wasting power. The new version of nanothread is more careful to wake only the required number of threads, and it does so using efficient OS primitives, such as futex on Linux (commits 73efa1, 366774).

    • Worker count: the main thread now “counts” as a member of the thread pool, since it pitches in when waiting for work. On Apple silicon, workers now only run on “performance cores”, as parallelization over “efficiency cores” tends to add tail latency that slows down parallel workloads. (commits 03cacd, 348404. e68a4d, 098925, beca8c).

    • Fixed timing glitches: timing information reported by dr.kernel_history() would occasionally report nonsensical values close to 2^64 due to a race condition that is now fixed. (f11692).

Minor features

  • CUDA Green Context API: Added drjit.cuda.green_context, a context manager that isolates kernels to a subset of the GPU’s streaming multiprocessors. See the green context documentation for details. (Dr.Jit commit 6c69ec, Dr.Jit-Core commit d4f1a6).

  • Command queue flushing: The new dr.flush_thread() function flushes queued work to the GPU, which is needed for multi-threaded use of Dr.Jit on the Metal backend. (Dr.Jit commit c68e00, Dr.Jit-Core commit 467dd3).

Bug Fixes

  • Fixed a bug in dr.rng().integers() where a symbolic loop was misused, producing invalid LLVM IR. (commit f7054e).

  • Fixed a variable shadowing bug in _flatten/_unflatten that caused crashes when flattening PyTrees containing custom DRJIT_STRUCT types. (PR #482).

  • Fixed a bug in nn.SinEncode where the per-octave phase offset did not match the documented formula. Code using shift=0 is unaffected.

  • Fixed incorrect type names in dr.graphviz_ad(). (commit 0c685e).

  • Fixed minor memory leaks due to recorded/frozen kernels. (Dr.Jit-Core commit f0bf64).

  • Fixed memory leaks related to kernel histories. (Dr.Jit-Core commit 318e55).

  • Renamed the conflicting KernelRecordingMode.None enumerator to Inactive to avoid the collision with Python’s None. (Dr.Jit-Core PR #186, Dr.Jit PR #481).

  • Fixed several issues involving symbolic loops with aliased state variables. (Dr.Jit PRs #505, #510, Dr.Jit-Core PR #198, contributed by Lovro Nuic).

  • Fixed half-precision Min/Max reductions and the half-precision infinity constant. (Dr.Jit-Core PR #199).

  • Various smaller backend fixes: a missing mask predicate in the CUDA packet scatter_reduce path, a crash in Metal cooperative-vector matrix-vector products with unsupported output dimensions, a race condition under multi-threaded Metal use, incorrect fast-math flag handling on Sqrt and Div nodes, and more robust handling of failed jit_eval() calls. (Dr.Jit-Core PRs #191, #200, #196, #192, commits 368c53, 37bbce, Dr.Jit PR #503).

Other Improvements

  • Improved documentation and error messages when the Dr.Jit binary fails to load. (PR #485).

  • Various improvements to Dr.Jit’s static type annotations: added missing stubs for dr.mean(), added type hints for PrefixRedOp, and minor stub pattern replacement rule fixes. (PRs #478, #480, #483).

  • Release the GIL while waiting for kernel history: Retrieving timing data via dr.kernel_history() now releases the GIL while waiting for the asynchronous results to arrive, allowing other Python threads to make progress in the meantime. (commits 766e1e, f90bfd).

  • ndarray Cleanup: ndarray reclamation previously always went through an asynchronous cleanup thread. This detour is now skipped for CUDA and Metal arrays when the calling thread already holds the GIL. (commit c01a23).

API Breaks and Device Compatibility

  • ⚠️ nn.pack() and nn.unpack() no longer return the shared buffer as the first element of the result tuple. They now return only the packed/unpacked PyTree with matrix views in place of the input tensors. The underlying buffer remains available via the MatrixView.buffer attribute, or, for a packed nn.Module, via the 'weights' entry of the module’s mapping interface (i.e. net['weights']).

    Migration:

    # Before
    buffer, A_view, b_view = nn.pack(A, b, layout='training')
    dr.enable_grad(buffer)
    
    # After
    A_view, b_view = nn.pack(A, b, layout='training')
    dr.enable_grad(A_view.buffer)
    

    For a packed nn.Module:

    # Before
    buffer, net = nn.pack(net, layout='training')
    dr.enable_grad(buffer)
    
    # After
    net = nn.pack(net, layout='training')
    dr.enable_grad(net['weights'])
    
  • ⚠️ Removed TensorFlow support. TensorFlow appears largely unmaintained. Over a year after the launch of NVIDIA’s Blackwell GPU generation, there is still no official support in the official TensorFlow packages. This is a maintenance burden as our CI infrastructure uses this GPU. Consequently, we decided to drop Tensorflow support (.tf() conversion, support in @dr.wrap).

  • ⚠️ Removed Kahan-compensated atomic scatter. The drjit.scatter_add_kahan() operation was removed. See commit f6b4be for the rationale.

  • ⚠️ Compute capability. Dr.Jit-Core’s CUDA backend now requires compute capability 7.5 or higher (Turing and later) and NVIDIA driver R535 or newer. (Dr.Jit-Core PR #188).

DrJit 1.3.1 (February 23, 2026)

Bug Fixes

  • Fixed LLVM library search paths to include aarch64 and WSL-specific directories. This resolves failures to locate LLVM on ARM Linux systems and Windows Subsystem for Linux. (Dr.Jit-Core PR #185).

  • Fixed ordering of CUDA forward declarations of callables, resolving cases where a forward declaration could appear after the actual function definition. (Dr.Jit-Core commit 213983).

DrJit 1.3.0 (February 16, 2026)

New Features

  • Atomic Scatter Operations: Added dr.scatter_cas() (atomic compare-and-swap) and dr.scatter_exch() (atomic exchange) operations. On the CUDA backend, these map to native PTX instructions; the LLVM implementation uses a loop over the vectorization width. (Dr.Jit PR #450, Dr.Jit-Core PR #177).

  • AdamW Optimizer: Added the dr.opt.AdamW optimizer with built-in weight decay, equivalent to PyTorch’s implementation. (PR #449).

  • AMSGrad for Adam/AdamW: The dr.opt.Adam and dr.opt.AdamW optimizers now support an optional amsgrad parameter. AMSGrad keeps a running maximum of the second moments, which can help improve stability near local minima. (PR #467).

  • Functions in IR dr.func(): A new function decorator that forces a Python function to also become a callable in the generated IR. This can improve compilation times: without it, Dr.Jit emits the function body’s IR every time it is called within a single kernel. With @dr.func, each call resolves to a function call in the IR, emitting the body only once. (Dr.Jit PR #473, Dr.Jit-Core PR #183).

  • Oklab Color Space Conversion: Added dr.linear_srgb_to_oklab() and dr.oklab_to_linear_srgb() for perceptually uniform color space conversion. (PR #453).

  • Pickling Support: Dr.Jit arrays can now be natively pickled and unpickled via Python’s pickle module. (PR #448).

  • Bounded Integer RNG: Added dr.rng().integers() to generate uniformly distributed integers on a given interval. (commit cb09ca).

  • Symbolic RNG mode: dr.rng() now accepts a symbolic argument for a purely symbolic sampler. (commit 51bacb).

  • ArrayX Initialization from Tensors: Nested array types with multiple dynamic dimensions (like ArrayXf) can now be initialized from Dr.Jit tensors or NumPy arrays. (commit e7e133).

  • Type Trait: Added dr.replace_shape_t() convenience type trait for writing generic functions that need to reshape array types. (commit 46b245).

Hardware/platform-specfic features

  • NVIDIA Blackwell (SM120+): Added support for wide packet loads, gathers, and atomics on NVIDIA Blackwell GPUs (SM120+). (commit 879c10).

  • Python 3.14 Compatibility: Fixed compatibility with PEP 649 deferred annotation evaluation, ensuring Dr.Jit works correctly on Python 3.14. (commit 7fa6eb).

  • Linux ARM Wheels: Added ubuntu-24.04-arm to the wheels pipeline. (PR #461, contributed by Merlin Nimier-David).

Performance Improvements

  • Simplified Single-Target Virtual Calls: When a virtual function call has only a single target (as is the case for @dr.func), the JIT backend now eliminates the indirection/dispatch loop and calls the function directly, producing simpler IR. (Dr.Jit-Core PR #183).

  • AD Early Exit for Zero Derivatives: The AD graph traversal now skips edges with zero-valued derivatives, avoiding unnecessary computation. (commit 06b0a9).

  • GIL Release in __getitem__: dr.ArrayBase.__getitem__() now releases the GIL while waiting, improving multi-threaded performance. (commit c24be7).

Bug Fixes

  • Fixed a bug where constructing a cooperative vector inside a dr.suspend_grad() scope could raise an exception. (PR #475, contributed by Christian Döring).

  • Fixed a crash when calling a frozen function with a re-seeded random number generator whose seed was a Python integer type. (PR #471, contributed by Christian Döring).

  • Fixed a bug in the C++ transform_compose() function where the translation was placed in the last row of the matrix rather than the last column. (PR #451, contributed by Delio Vicini).

  • Fixed multiple issues in the Dr.Jit-Core gather re-indexing logic: the mask stack is now correctly applied during re-indexing, and nested gather masks are combined rather than overwritten. (Dr.Jit-Core PR #178).

  • Fixed a bug in virtual call analysis when a target contained a symbolic loop — the analysis now accounts for eliminated/optimized-out loop state variables. (Dr.Jit-Core PR #184).

  • Fixed LLVM backend compilation of wavefront loops with scalar masks. (commit 16a81d).

  • Fixed lost tensor shapes when a loop or conditional is replayed for AD passes, with more robust inference of tensor output shapes. (commit 9d201f).

  • Fixed a regression in ArrayX initialization from tensors and NumPy ndarrays (wrong shape hint order for flipped axes and broken shift loop). (commit df4cf4).

  • Fixed Texture::eval_fetch_cuda to handle double-precision queries gracefully by casting to single-precision when a HW-accelerated texture is requested. (commits 83083d, 054d11).

  • Fixed symbolic loop size computation to also account for side-effect sizes. (Dr.Jit-Core commit c6dfc8).

  • Fixed spurious warning when freezing functions with very wide literals. (PR #455).

Other Improvements

  • Updated to nanobind v2.10.2.

  • Improved documentation and log messages for textures, including clarifications regarding numerical precision and extra diagnostics for migrated textures. (commit 4edae0).

DrJit 1.2.0 (September 17, 2025)

New Features

  • Event API: Added an event API for fine-grained timing and synchronization of GPU kernels. This enables more detailed performance profiling and better control over asynchronous operations. (Dr.Jit PR #441, Dr.Jit-Core PR #174).

  • OpenGL Interoperability: Improved CUDA-OpenGL interoperability with simplified APIs. This enables efficient sharing of data between CUDA kernels and OpenGL rendering. (Dr.Jit PR #429, Dr.Jit-Core PR #164, contributed by Merlin Nimier-David).

  • Enhanced Int8/UInt8 Support: Improved support for 8-bit integer types with better casting and bitcast operations. (Dr.Jit PR #428, Dr.Jit-Core PR #163, contributed by Merlin Nimier-David).

Performance Improvements

  • Register Spilling to Shared Memory: CUDA backend now supports spilling registers to shared memory, improving performance for kernels with high register pressure. (Dr.Jit-Core commit 5cf6d3).

  • Memory View Support: Arrays can now be converted to Python memoryview objects for efficient zero-copy data access. (commit b70391).

  • DLPack GIL Release: The dr.ArrayBase.dlpack() method now releases the GIL while waiting, improving multi-threaded performance. (commit 0adf9b).

  • Thread Synchronization: dr.sync_thread() now releases the GIL while waiting, preventing unnecessary blocking in multi-threaded applications. (commit 956d2f).

API Improvements

  • Spherical Direction Utilities: Added Python implementation of spherical direction utilities (dr.sphdir). (PR #432, contributed by Sébastien Speierer).

  • Matrix Conversions: Added support for converting between 3D and 4D matrices: Matrix4f can be constructed from a 3D matrix and Matrix3f from a 4D matrix. (commit 7f8ea8).

  • Quaternion API: Improved the quaternion Python API for better usability and consistency. (commit 282da8).

  • Type casts: Allow casting between Dr.Jit types to properly allow AD<->non-AD conversions when required. (commit 72f1e6).

Bug Fixes

  • Fixed deadlock issues in @dr.freeze decorator. (commit e8fc55).

  • Fixed gradient tracking in Texture.tensor() to ensure gradients are never dropped inadvertently. (PR #444).

  • Fixed AD support for C++ repeat and tile operations with proper gradient propagation. (commits fd6930, 282da8).

  • Fixed Python object traversal to check that __dict__ exists before accessing it, preventing crashes with certain object types. (commit 433ada).

  • Fixed symbolic loop size calculation to properly account for side-effects. (Dr.Jit-Core commit 31bf91).

  • Fixed read-after-free issue in OptiX SBT data loading. (Dr.Jit-Core commit 009ade, contributed by Merlin Nimier-David).

Other Improvements

  • Updated to nanobind v2.9.2

  • Improved error messages by adding function names to vectorized call errors. (Dr.Jit-Core PR #165, contributed by Sébastien Speierer).

  • Added missing checks for JIT leak warnings. (Dr.Jit-Core PR #166, contributed by Sébastien Speierer).

  • Added warning for LLVM API initialization failures. (Dr.Jit-Core PR #168, contributed by Sébastien Speierer).

  • Fixed pytest warnings and improved test infrastructure. (PR #436).

DrJit 1.1.0 (August 7, 2025)

The v1.1.0 release of Dr.Jit includes several major new features:

Major Features

  • Cooperative Vectors: Dr.Jit now provides an API to efficiently evaluate matrix-vector products in parallel programs. The API targets small matrices (e.g., 128x128, 64×64, or smaller) and inlines all computation into the program. Threads work cooperatively to perform these operations efficiently. On NVIDIA GPUs (Turing or newer), this leverages the OptiX cooperative vector API with tensor core acceleration. On the LLVM backend, operations compile to sequences of packet instructions (e.g., AVX512). See the cooperative vector documentation for more details. Example:

    import drjit as dr
    import drjit.nn as nn
    from drjit.cuda.ad import Float16, TensorXf16
    
    # Create a random number generator
    rng = dr.rng(seed=0)
    
    # Create a matrix and bias representing an affine transformation
    A = rng.normal(TensorXf16, shape=(3, 16))  # 3×16 matrix
    b = TensorXf16([1, 2, 3])                  # Bias vector
    
    # Pack into optimized memory layout
    buffer, A_view, b_view = nn.pack(A, b)
    
    # Create cooperative a vector from 16 inputs
    vec_in = nn.CoopVec(Float16(1), Float16(2), ...)
    
    # Perform matrix-vector multiplication: A @ vec_in + b
    vec_out = nn.matvec(A_view, vec_in, b_view)
    
    # Unpack result back to regular arrays
    x, y, z = vec_out
    

    (Dr.Jit PR #384, Dr.Jit-Core PR #141).

  • Neural Network Library: Building on the cooperative vector functionality, the new drjit.nn module provides modular abstractions for constructing, evaluating, and optimizing neural networks, similar to PyTorch’s nn.Module. This enables fully fused evaluation of small multilayer perceptrons (MLPs) within larger programs. See the neural network module documentation for more details. Example:

    import drjit.nn as nn
    from drjit.cuda.ad import TensorXf16, Float16
    
    # Define a small MLP for function approximation
    net = nn.Sequential(
        nn.SinEncode(16),                 # Sinusoidal encoding
        nn.Linear(-1, -1, bias=False),    # Hidden layer
        nn.ReLU(),
        nn.Linear(-1, -1, bias=False),    # Hidden layer
        nn.ReLU(),
        nn.Linear(-1, 3, bias=False),     # Output layer (3 outputs)
        nn.Tanh()
    )
    
    # Instantiate and optimize for 16-bit tensor cores
    rng = dr.rng(seed=0)
    net = net.alloc(dtype=TensorXf16, size=2, rng=rng)
    weights, net = nn.pack(net, layout='training')
    
    # Evaluate the network
    inputs = nn.CoopVec(Float16(0.5), Float16(0.7))
    outputs = net(inputs)
    x, y, z = outputs  # Three output values
    

    (PR #384).

  • Hash Grid Encoding: Added neural network hash grid encoding inspired by Instant NGP, providing multi-resolution spatial encodings. This includes both traditional hash grids and permutohedral encodings for efficient high-dimensional inputs. (PR #390, contributed by Christian Döring and Merlin Nimier-David).

  • Function Freezing: Added the @dr.freeze decorator to eliminate repeated tracing overhead by caching and replaying JIT-compiled kernels. Dr.Jit normally traces operations to build computation graphs for compilation, which can become a bottleneck when the same complex computation is performed repeatedly (e.g., in optimization loops). The decorator records kernel launches on the first call and replays them directly on subsequent calls, avoiding re-tracing.

    This can dramatically accelerate programs and makes Dr.Jit usable for realtime rendering and other applications with strict timing requirements. See the function freezing documentation for more details. Example:

    import drjit as dr
    from drjit.cuda import Float, UInt32
    
    # Without freezing - traces every time
    def func(x):
        y = seriously_complicated_code(x)
        dr.eval(y) # ..intermediate evaluations..
        return huge_function(y, x)
    
    # With freezing - traces only once
    @dr.freeze
    def frozen(x):
        ... # same code as above -- no changes needed
    

    (Dr.Jit PR #336, Dr.Jit-Core PR #107, contributed by Christian Döring).

  • Shader Execution Reordering (SER): Added the function dr.reorder_threads() to shuffle threads across the GPU to reduce warp-level divergence. When threads in a warp take different branches (e.g., in dr.switch() statements or vectorized virtual function calls) performance can degrade significantly. SER can group threads with similar execution paths into coherent warps to avoid this. This feature is a no-op in LLVM mode. Example:

    import drjit as dr
    from drjit.cuda import Array3f, UInt32
    
    arg = Array3f(...) # Prepare data and callable index
    callable_idx = UInt32(...) % 4  # 4 different callables
    
    # Reorder threads before dr.switch() to reduce divergence
    # The key uses 2 bits (for 4 callables)
    arg = dr.reorder_threads(key=callable_idx, num_bits=2, value=arg)
    
    # Now, threads with the same callable_idx are grouped together
    callables = [func0, func1, func2, func3]
    out = dr.switch(callable_idx, callables, arg)
    

    (Dr.Jit PR #395, Dr.Jit-Core PR #145).

    Related to this, the OptiX backend now requires the OptiX 8.0 ABI (specifically, ABI version 87). This is a requirement for SER. (Dr.Jit-Core PR #117).

  • Random Number Generation API: Introduced a new random number generation API around an abstract Generator object analogous to NumPy. Under the hood, this API uses the Philox4x32 counter-based PRNG from Salmon et al. [2011], which provides high-quality random variates that are statistically independent within and across parallel streams. Users create generators with dr.rng() and call methods like .random() and .normal(). Example:

    import drjit as dr
    from drjit.cuda import Float, TensorXf
    
    # Create a random number generator
    rng = dr.rng(seed=42)
    
    # Generate various random distributions
    uniform = rng.random(Float, 1000)        # Uniform [0, 1)
    normal = rng.normal(Float, 1000)         # Standard normal
    tensor = rng.random(TensorXf, (32, 32))  # Random tensor
    

    (PR #417).

  • Array Resampling and Convolution: Added dr.resample() for changing the resolution of arrays/tensors along specified axes, and dr.convolve() for convolution with continuous kernels. Both operations are fully differentiable and support various reconstruction filters (box, linear, cubic, lanczos, gaussian). Example:

    # Resample a 2D signal to different resolution
    data = dr.cuda.TensorXf(original_data)  # Shape: (128, 128)
    upsampled = dr.resample(
        data,
        shape=(256, 256),    # Target resolution
        filter='lanczos'     # High-quality filter
    )
    
    # Apply Gaussian blur via convolution
    blurred = dr.convolve(
        data,
        filter='gaussian',
        radius=2.0
    )
    

    (PRs #358, #378).

  • Gradient-Based Optimizers: Added an optimization framework that includes various standard optimizers inspired by PyTorch. It includes dr.opt.SGD with optional momentum and Nesterov acceleration, dr.opt.Adam with adaptive learning rates, and dr.opt.RMSProp. The optimizers own the parameters and automatically handle mixed-precision training. An optional helper class dr.opt.GradScalar implements adaptive gradient scaling for low-precision training.

    from drjit.opt import Adam
    from drjit.cuda import Float
    
    # Create optimizer and register parameters
    opt = Adam(lr=1e-3)
    rng = dr.rng(seed=0)
    opt['params'] = Float(rng.normal(Float, 100))
    
    # Optimization loop for unknown function f(x)
    for i in range(1000):
        # Fetch current parameters
        params = opt['params']
    
        # Compute loss and gradients
        loss = f(params)  # Some function to optimize
        dr.backward(loss)
    
        # Update parameters
        opt.step()
    

    (PRs #345, #402, commit e3f576).

  • TensorFlow Interoperability: Added TensorFlow interop via @dr.wrap, supporting forward and backward automatic differentiation with comprehensive support for variables and tensors. (PR #301, contributed by Jakob Hoydis).

Array and Tensor Operations

  • Added dr.concat() to concatenate arrays/tensors along a specified axis following the Array API standard. (PR #354).

  • Added dr.take() and dr.take_interp() for efficient tensor indexing and interpolated indexing along specified axes. (PR #420, commit b59436).

  • Added dr.moveaxis() for rearranging tensor dimensions, providing NumPy-compatible axis movement. (commit 4d1478).

  • Implemented comprehensive slice operations for regular (non-tensor) arrays, supporting advanced patterns like nested slices and integer array indexing. (PR #365).

  • Conversion between tensors and nested arrays (e.g. Array3f) now offers an option (flip_axis=True) of whether or not to flip the axis order (e.g., Nx3 vs 3xN). (PR #348).

Performance Improvements

  • Packet scatter-add operations now map to specialized GPU operations when supported by the hardware and driver. This change also broadens the situations where packet operations can be used on the CPU and GPU. Packets of size 6 were not supported in the past since their size was not a power of two. Now, they are treated as 3 separate size-2 packets. This feature is particularly helpful in combination with the new hash grid class, whose reverse-mode derivative generates atomic packet scatter-additions. (Dr.Jit-Core PR #151, Dr.Jit PR #406).

  • Enabled packet memory operations for texture access, providing speedups when accessing multi-channel textures on the LLVM and CUDA backends. (PR #329).

  • Optimized dr.rsqrt() to compile to faster instruction sequences on the LLVM backend using VRSQRTPS with Newton-Raphson iteration on Intel processors and similar optimizations for ARM Neon. (Dr.Jit PR #343, Dr.Jit-Core PR #125).

  • Made dr.any(), dr.all(), and dr.none() asynchronous with respect to the host, improving GPU utilization. (Dr.Jit PR #344, Dr.Jit-Core PR #126).

Random Number Generation (contd.)

API Improvements

Notable Bugfixes

  • Fixed dr::block_reduce() derivative computation for arrays not evenly divisible by block size. (commit df79ed).

  • Fixed potential performance cliffs in dr.gather() by memoizing expressions and limiting expression growth (Dr.Jit-Core PR #159).

  • Fixed dr.rotate() quaternion component ordering to match C++ implementation. (PR #416).

  • Fixed the derivative of dr.unit_angle() at signed zero. (commit 9d09a9).

  • Fixed memory leak in Python bindings using dedicated cleanup thread. (PR #399).

  • Preserve tensor shapes in symbolic operations. (commit 74c4d0).

  • Fixed evaluated loop derivative issues with unchanged differentiable state variables. (commit 074cfe).

  • Fixed symbolic loop backward derivative compilation for simple loops. (commit 01ef10).

  • Fixed broadcasting of tensors and handling of unknown objects in dr.select(). (PRs #339, PRs #349).

  • Fixed dr.abs() derivative at x=0 to match PyTorch behavior. (commit c597de).

  • Fixes for NVIDIA 50-series GPUs and recent driver versions. (Dr.Jit-Core PR #152).

Other Improvements

  • Fixed several corner cases in dr.dda.dda() (PR #311).

  • Added support for casting to and from boolean array types in Python. (commit 343d16).

  • Enhanced dr.expr_t() to preserve custom array types when compatible. (commit 85d66c).

  • Improved dr.replace_grad() to handle non-differentiable and unknown types gracefully. (PR #364).

  • Improved error handling throughout the codebase by replacing abort() calls with exceptions for better recovery in interactive environments. (commit 27e34c).

  • Added dr.profile_enable() context manager for selective CUDA profiling using the NSight tools. (commit e4dda9).

  • When compiling Dr.Jit with Clang/Linux, libstdc++ can now also be used. Previously, the libc++ standard library was required in this case. (PR #346).

DrJit 1.0.5 (February 3, 2025)

  • Workaround for OptiX linking issue in driver version R570+. (commit 0c9c54).

  • Tensors can now be used as condition and state variables of dr.if_stmt/while_loop. (commit 4691fe).

DrJit 1.0.4 (January 28, 2025)

  • Release was retracted

DrJit 1.0.3 (January 16, 2025)

DrJit 1.0.2 (January 14, 2025)

  • Warning about NVIDIA drivers v565+. (commit b5fd88).

  • Support for boolean Python arguments in drjit.select(). (commit d0c881).

  • Backend refactoring: vectorized calls are now also isolated per variant. (commit 17bc70).

  • Fixes to dr::safe_cbrt(). (commit 2f8a3a).

DrJit 1.0.1 (November 23, 2024)

DrJit 1.0.0 (November 21, 2024)

The 1.0 release of Dr.Jit marks a major new phase of this project. We addressed long-standing limitations and thoroughly documented every part of Dr.Jit. Due to the magnitude of the changes, some incompatibilities are unavoidable: bullet points with an exclamation mark highlight changes with an impact on source-level compatibility.

Here is what’s new:

  • Python bindings: Dr.Jit comes with an all-new set of Python bindings created using the nanobind library. This has several consequences:

    • Tracing Dr.Jit code written in Python is now significantly faster (we’ve observed speedups by a factor of ~10-20×). This should help in situations where performance is limited by tracing rather than kernel evaluation.

    • Thorough type annotations improve static type checking and code completion in editors like VS Code.

    • Dr.Jit can now target Python 3.12’s stable ABI. This means that binary wheels will work on future versions of Python without recompilation.

  • Natural syntax: vectorized loops and conditionals can now be expressed using natural Python syntax. To see what this means, consider the following function that computes an integer power of a floating point array:

    from drjit.cuda import Int, Float
    
    @dr.syntax # <-- new!
    def ipow(x: Float, n: Int):
        result = Float(1)
    
        while n != 0:       # <-- vectorized loop ('n' is an array)
            if n & 1 != 0:  # <-- vectorized conditional
                result *= x
            x *= x
            n >>= 1
    
        return result
    

    Given that this function processes arrays, we expect that condition of the if statement may disagree among elements. Also, each element may need a different number of loop iterations. However, such component-wise conditionals and loops aren’t supported by normal Python. Previously, Dr.Jit provided ways of expressing such code using masking and a special dr.cuda.Loop object, but this was rather tedious.

    The new @drjit.syntax decorator greatly simplifies the development of programs with complex control flow. It performs an automatic source code transformation that replaces conditionals and loops with array-compatible variants (drjit.while_loop(), drjit.if_stmt()). The transformation leaves everything else as-is, including line number information that is relevant for debugging.

  • Differentiable control flow: symbolic control flow constructs (loops) previously failed with an error message when they detected differentiable variables. In the new version of Dr.Jit, symbolic operations (loops, function calls, and conditionals) are now differentiable in both forward and reverse modes, with one exception: the reverse-mode derivative of loops is still incomplete and will be added in the next version of Dr.Jit.

  • Documentation: every Dr.Jit function now comes with extensive reference documentation that clearly specifies its behavior and accepted inputs. The behavior with respect to tensors and arbitrary object graphs (referred to as “PyTrees”) was made consistent.

  • Half-precision arithmetic: Dr.Jit now provides float16-valued arrays and tensors on both the LLVM and CUDA backends (e.g., drjit.cuda.ad.TensorXf16 or drjit.llvm.Float16).

  • Mixed-precision optimization: Dr.Jit now maintains one global AD graph for all variables, enabling differentiation of computation combining single-, double, and half precision variables. Previously, there was a separate graph per type, and gradients did not propagate through casts between them.

  • Multi-framework computations: The @drjit.wrap decorator provides a differentiable bridge to other AD frameworks. In this new release of Dr.Jit, its capabilities were significantly revamped. Besides PyTorch, it now also supports JAX, and it consistently handles both forward and backward derivatives. The new interface admits functions with arbitrary fixed/variable-length positional and keyword arguments containing arbitrary PyTrees of differentiable and non-differentiable arrays, tensors, etc.

  • Debug mode: A new debug validation mode (drjit.JitFlag.Debug) inserts a number of additional checks to identify sources of undefined behavior. Enable it to catch out-of-bounds reads, writes, and calls to undefined callables. Such operations will trigger a warning that includes the responsible source code location.

    The following built-in assertion checks are also active in debug mode. They support both regular and symbolic inputs in a consistent fashion.

  • Symbolic print statement: A new high-level symbolic print operation drjit.print() enables deferred printing from any symbolic context (i.e., within symbolic loops, conditionals, and function calls). It is compatible with Jupyter notebooks and displays arbitrary PyTrees in a structured manner. This operation replaces the function drjit.print_async() provided in previous releases.

  • Swizzling: swizzle access and assignment operator are now provided. You can use them to arbitrarily reorder, grow, or shrink the input array.

    a = Array4f(...), b = Array2f(...)
    a.xyw = a.xzy + b.xyx
    
  • Scatter-reductions: the performance of atomic scatter-reductions (drjit.scatter_reduce(), drjit.scatter_add()) has been significantly improved. Both functions now provide a mode= parameter to select between different implementation strategies. The new strategy drjit.ReduceMode.Expand offers a speedup of over 10× on the LLVM backend compared to the previously used local reduction strategy. Furthermore, improved code generation for drjit.ReduceMode.Local brings a roughly 20-40% speedup on the CUDA backend. See the documentation section on atomic reductions for details and benchmarks with plots.

  • Packet memory operations: programs often gather or scatter several memory locations that are directly next to each other in memory. In principle, it should be possible to do such reads or writes more efficiently.

    Dr.Jit now features improved code generation to realize this optimization for calls to dr.gather() and dr.scatter() that access a power-of-two-sized chunk of contiguous array elements. On the CUDA backend, this operation leverages native package memory instruction, which can produce small speedups on the order of ~5-30%. On the LLVM backend, packet loads/stores now compile to aligned packet loads/stores with a transpose operation that brings data into the right shape. Speedups here are dramatic (up to >20× for scatters, 1.5 to 2× for gathers). See the drjit.JitFlag.PacketOps flag for details. On the LLVM backend, packet scatter-addition furthermore compose with the drjit.ReduceMode.Expand optimization explained in the last point, which combines the benefits of both steps. This is particularly useful when computing the reverse-mode derivative of packet reads.

  • Reductions: reduction operations previously existed as regular (e.g., drjit.all()) and nested (e.g. drjit.all_nested) variants. Both are now subsumed by an optional axis argument similar to how this works in other array programming frameworks like NumPy. Reductions can now also process any number of axes on both regular Dr.Jit arrays and tensors.

    The reduction functions (drjit.all() drjit.any(), drjit.sum(), drjit.prod(), drjit.min(), drjit.max()) have different default axis values depending on the input type. For tensors, axis=None by default and the reduction is performed along the entire underlying array recursively, analogous to the previous nested reduction. For all other types, the reduction is performed over the outermost axis (axis=0) by default.

    Aliases for the _nested function variants still exist to help porting but are deprecated and will be removed in a future release.

  • Prefix reductions: the functions drjit.cumsum(), drjit.prefix_sum() compute inclusive or exclusive prefix sums along arbitrary axes of a tensor or array. They wrap for the more general drjit.prefix_reduce() that also supports other arithmetic operations (e.g. minimum/maximum/product/and/or reductions), reverse reductions, etc.

  • Block reductions: the new functions drjit.block_reduce() and drjit.block_prefix_reduce() compute reductions within contiguous blocks of an array.

  • Local memory: kernels can now allocate temporary thread-local memory and perform arbitrary indexed reads and writes. This is useful to implement a stack or other types of scratch space that might be needed by a calculation. See the separate documentation section about local memory for details.

  • DDA: a newly added digital differential analyzer (drjit.dda.dda()) can be used to traverse the intersection of a ray segment and an n-dimensional grid. The function drjit.dda.integrate() builds on this functionality to compute analytic differentiable line integrals of bi- and trilinear interpolants.

  • Loop compression: the implementation of evaluated loops (previously referred to as wavefront mode) visits all entries of the loop state variables at every iteration, even when most of them have already finished executing the loop. Dr.Jit now provides an optional compress=True parameter in drjit.while_loop() to prune away inactive entries and accelerate later loop iterations.

  • The new release has a strong focus on error resilience and leak avoidance. Exceptions raised in custom operations, function dispatch, symbolic loops, etc., should not cause failures or leaks. Both Dr.Jit and nanobind are very noisy if they detect that objects are still alive when the Python interpreter shuts down.

  • Terminology cleanup: Dr.Jit has two main ways of capturing control flow (conditionals, loops, function calls): it can evaluate each possible outcome eagerly, causing it to launch many small kernels (this is now called: evaluated mode). The second is to capture control flow and merge it into the same kernel (this is now called symbolic mode). Previously, inconsistent and rendering-specific terminology was used to refer to these two concepts.

    Several entries of the drjit.JitFlag enumeration were renamed to reflect this fact (for example, drjit.JitFlag.VCallRecord is now called drjit.JitFlag.SymbolicCalls). The former entries still exist as (deprecated) aliases.

  • Index reuse: variable indices (drjit.ArrayBase.index, drjit.ArrayBase.index_ad) used to monotonically increase as variables were being created. Internally, multiple hash tables were needed to associate these ever-growing indices with locations in an internal variable array, which had a surprisingly large impact on tracing performance. Dr.Jit removes this mapping both at the AD and JIT levels and eagerly reuses variable indices.

    This change can be inconvenient for low-level debugging, where it was often helpful to inspect the history of operations involving a particular variable by searching a trace dump for mentions of its variable index. Such trace dumps were generated by setting drjit.set_log_level() to a level of drjit.LogLevel.Debug or even drjit.LogLevel.Trace. A new flag was introduced to completely disable variable reuse and help such debugging workflows:

    dr.set_flag(dr.JitFlag.ReuseIndices, False)
    

    Note that this causes the internal variable array to steadily grow, hence this feature should only be used for brief debugging sessions.

  • The drjit.empty() function used to immediate allocate an array of the desired shape (compared to, say, drjit.zero() which creates a literal constant array that consumes no device memory). Users found this surprising, so the behavior was changed so that drjit.empty() similarly delays allocation.

  • Fast math: Dr.Jit now has an optimization flag named drjit.JitFlag.FastMath that is reminiscent of -ffast-math in C/C++ compilers. It enables program simplifications such as a*0 == 0 that are not always valid. For example, equality in this example breaks when a is infinite or equal to NaN. The flag is on by default since it can considerably improve performance especially when targeting GPUs.

⚠️ Compatibility ⚠️

  • Symbolic loop syntax: the old “recorded loop” syntax is no longer supported. Existing code will need adjustments to use drjit.while_loop().

  • Comparison operators: The == and != comparisons previously reduced the result of to a single Python bool. They now return an array of component-wise comparisons to be more consistent with other array programming frameworks. Use dr.all(a == b) or dr.all(a == b, axis=None) to get the previous behavior.

    The functions drjit.eq() and drjit.neq() for element-wise equality and inequality tests were removed, as their behavior is now subsumed by the builtin == and != operators.

  • Matrix layout: The Dr.Jit matrix type switched from column-major to row-major storage. Your code will need to be updated if it indexes into matrices first by column and then row (matrix[col][row]) instead of specifying the complete location matrix[row, col]. The latter convention is consistent between both versions.

Internals

This section documents lower level changes that don’t directly impact the Python API.

  • Compilation of Dr.Jit is faster and produces smaller binaries. Downstream projects built on top of Dr.Jit will also see improvements on both metrics.

  • Dr.Jit now builds a support library (libdrjit-extra.so) containing large amounts of functionality that used to be implemented using templates. The disadvantage of the previous template-heavy approach was that this code ended up getting compiled over and over again especially when Dr.Jit was used within larger projects such as Mitsuba 3, where this caused very long compilation times.

    The following features were moved into this library:

    • Transcendental functions (drjit.log(), drjit.atan2(), etc.) now have pre-compiled implementations for Jit arrays. Automatic differentiation of such operations was also moved into libdrjit-extra.so.

    • The AD layer was rewritten to reduce the previous backend (drjit/autodiff.h) into a thin wrapper around functionality in libdrjit-extra.so. The previous AD-related shared library libdrjit-autodiff.so no longer exists.

    • The template-based C++ interface to perform vectorized method calls on instance arrays (drjit/vcall.h, drjit/vcall_autodiff.h, drjit/vcall_jit_reduce.h, drjit/vcall_jit_record.h) was removed and turned into generic implementation within the libdrjit-extra.so library. All functionality (symbolic/evaluated model, automatic differentiation) is now exposed through a single statically precompiled function (ad_call). The same function is also used to realize the Python interface (drjit.switch(), drjit.dispatch()).

      To de-emphasize C++ virtual method calls (the interface is more broadly about calling things in parallel), the header file was renamed to drjit/call.h. All macro uses of DRJIT_VCALL_* should be renamed to DRJIT_CALL_*.

    • Analogous to function calls, the Python and C++ interfaces to symbolic/evaluated loops and conditionals are each implemented through a single top-level function (ad_loop and ad_cond) in libdrjit-extra.so. This removes large amounts of template code and accelerates compilation.

  • Improvements to CUDA and LLVM backends kernel launch configurations that more effectively use the available parallelism.

  • The packet mode backend (include/drjit/packet.h) now includes support for aarch64 processors via NEON intrinsics. This is actually an old feature from a predecessor project (Enoki) that was finally revived.

  • The nb::set_attr() function that was previously used to update modified fields queried by a getter no longer exists. Dr.Jit now uses a simpler way to deal with getters. The technical reason that formerly required the presence of this function doesn’t exist anymore.

Removals

  • Packet-mode virtual function call dispatch (drjit/vcall_packet.h) was removed.

  • The legacy string-based IR in Dr.Jit-core has been removed.

  • The ability to instantiate a differentiable array on top of a non-JIT-compiled type (e.g., dr::DiffArray<float>) was removed. This was in any case too inefficient to be useful besides debugging.

Other minor technical improvements

  • drjit.switch() and drjit.dispatch() now support all standard Python calling conventions (positional, keyword, variable length).

  • There is a new C++ interface named drjit::dispatch() that works analogously to the Python version.

  • The drjit.reinterpret_array_v function was renamed to drjit.reinterpret_array().

  • The drjit.llvm.PCG32.seed() function (and other backend variants) were modified to add the lane counter to both initseq and initstate. Previously, the counter was only added to the former, which led to noticeable correlation artifacts.