Optimizations

This section lists optimizations performed by Dr.Jit while tracing code. The examples all use the following import:

>>> from drjit.auto import Int

Vectorization and parallelization

Dr.Jit automatically vectorizes and parallelizes traced code. The implications of these transformations are backend-specific.

Consider the following simple calculation, which squares an integer sequence with 10000 elements.

>>> dr.arange(Int, 10000)**2
[0, 1, 4, .. 9994 skipped .., 99940009, 99960004, 99980001]

On the LLVM backend, vectorization means that generated code uses instruction set extensions such as Intel AVX/AVX2/AVX512, or ARM NEON when they are available. For example, when the machine supports the AVX512 extensions, each machine instruction processes a packet of 16 values, which means that a total of 625 packets need to be evaluated.

The system uses the built-in nanothread thread pool to distribute packets to be processed among the available processor cores. In this simple example, there is not enough work to truly benefit from multi-core parallelism, but this approach pays off in more complex examples.

You can use the functions drjit.thread_count(), drjit.set_thread_count() to specify the number of threads used for parallel processing.

On the GPU backends (CUDA and Metal), the system automatically determines a number of threads that maximize occupancy along with a suitable number of blocks and then launches a parallel program that spreads out over the entire GPU (assuming that there is enough work to do so).

Copy-on-Write

Arrays are reference-counted and use a Copy-on-Write (CoW) strategy. This means that copying an array is cheap since the copy can reference the original array without requiring a device memory copy. The matching variable indices in the example below demonstrate the lack of an actual copy.

>>> a = Int(1, 2, 3)
>>> b = Int(a)        # <- create a copy of 'a'
>>> a.index, b.index
(1, 1)

However, subsequent modification causes this copy to be made.

>>> b[0] = 0
>>> (a.index, b.index)
(1, 2)

This optimization is always active and cannot be disabled.

Constant propagation

Dr.Jit immediately performs arithmetic involving literal constant arrays:

>>> a = Int(4) + Int(5)
>>> a.state
dr.VarState.Literal

In other words, the addition does not become part of the generated device code. This optimization reduces the size of the generated LLVM/PTX IR and can be controlled via drjit.JitFlag.ConstantPropagation.

Dead code elimination

When generating code, Dr.Jit excludes unnecessary operations that do not influence arrays evaluated by the kernel. It also removes dead branches in loops and conditional statements.

This optimization is always active and cannot be disabled.

Value numbering

Dr.Jit collapses identical expressions into the same variable (this is safe given the CoW strategy explained above).

>>> a, b = Int(1, 2, 3), Int(4, 5, 6)
>>> c = a + b
>>> d = a + b
>>> c.index == d.index
True

This optimization reduces the size of the generated LLVM/PTX IR and can be controlled via drjit.JitFlag.ValueNumbering.

Local atomic reduction

Atomic memory operations can be a bottleneck when they encounter write contention, which refers to a situation where many threads attempt to write to the same array element at once.

For example, the following operation causes 1’000’000 threads to write to a[0].

>>> a = dr.zeros(Int, 10)
>>> dr.scatter_add(target=a, index=dr.zeros(Int, 1000000), value=...)

Since Dr.Jit vectorizes the program during execution, the computation is grouped into packets that typically contain 16 to 32 elements. By locally pre-accumulating the values within each packet and then only performing 31-62K atomic memory operations (instead of 1’000’000), performance can be considerably improved.

This issue is particularly important when automatically differentiating computation in reverse mode (e.g. drjit.backward()), since this transformation turns differentiable global memory reads into atomic scatter-additions. A differentiable scalar read is all it takes to create such an atomic memory bottleneck.

The following plots illustrate the expected level performance in a microbenchmark that scatters-adds \(10^8\) random integers into a buffer at uniformly distributed positions. The size of the target buffer varies along the horizontal axis. Generally, we expect to see significant contention on the left, since this involves a large number of writes to only a few elements. The behavior of GPU and CPU atomics are somewhat different, hence we look at them in turn starting with the CUDA backend.

The drjit.ReduceMode.Direct strategy generates a plain atomic operation without additional handling. This generally performs badly except for two special cases: when writing to a scalar array, the NVIDIA compiler detects this and performs a specialized optimization (that is, however, quite specific to this microbenchmark and unlikely to work in general). Towards the right, there is essentially no contention and multiple writes to the same destination are unlikely to appear within the same warp, hence drjit.ReduceMode.Direct outperforms the other methods.

https://d38rqfq1h7iukm.cloudfront.net/media/uploads/wjakob/2024/01/scatter_add_cuda.svg https://d38rqfq1h7iukm.cloudfront.net/media/uploads/wjakob/2024/01/scatter_add_cuda_dark.svg

The drjit.ReduceMode.Local strategy in the above plot performs a butterfly reduction to locally pre-reduce writes targeting the same region of memory, which significantly reduces the dangers of atomic memory contention.

On the CPU (LLVM) backend, Direct mode can become so slow that this essentially breaks the program. The Local strategy is analogous to the CUDA backend and improves performance by an order of magnitude when many writes target the same element. In this benchmark, that becomes less likely as the target array grows, and the optimization becomes ineffective.

https://d38rqfq1h7iukm.cloudfront.net/media/uploads/wjakob/2024/01/scatter_add_llvm.svg https://d38rqfq1h7iukm.cloudfront.net/media/uploads/wjakob/2024/01/scatter_add_llvm_dark.svg

The drjit.ReduceMode.Expand strategy produces a near-flat profile. It replicates the target array to avoid write conflicts altogether, which enables the use of non-atomic memory operations. This is significantly faster but also very memory-intensive, as the storage cost of an 1 MiB array targeted by a drjit.scatter_reduce() operation now grows to N MiB, where N is the number of cores. The functions expand_threshold() and set_expand_threshold() can be used to set thresholds that determine when Dr.Jit is willing to automatically use this strategy.

Packet memory operations

The functions drjit.gather(), drjit.scatter(), and drjit.scatter_reduce() can be used to access vectors in a flat array.

For example,

>>> buffer = Float(...)
>>> vec4_out = dr.gather(dtype=Array4f, source=buffer, index=..)

is equivalent to (but more efficient than) four subsequent gathers that access elements index4*0 to index*4+3. Dr.Jit compiles such operations into packet memory operations whenever the size of the output array is a power of two. Other sizes are decomposed into sequences of smaller packet operations (for example, size 24 is realized as 3 packets with width 8). This yields a small performance improvement on the GPU (on the order of 5-30%) and a massive speedup on the LLVM CPU backend especially for scatters. See the flag drjit.JitFlag.PacketOps for details.

Other

Some other optimizations are specific to symbolic operations, such as

Please refer the documentation of these flags for details.

Matrix multiplication

Dr.Jit ships with efficient matrix multiplication kernels powering the @ operator and dr.matmul().

Their design follows the same tradeoff as the other precompiled CUDA kernels shipped with Dr.Jit: rather than chase the last few percent of peak throughput, we aim for kernels that are small enough to ship and simple enough to maintain, while staying reasonably competitive with specialized libraries. On the CPU this matches or slightly exceeds OpenBLAS across the sizes we benchmarked; on the GPU we reach 55%–75% of PyTorch on large single-precision GEMMs without pulling in cuBLAS or cuDNN. The CUDA implementation does not use the tensor cores available on recent NVIDIA GPUs, so for half-precision matrix multiplies PyTorch will be more than 4x faster on GeForce-type cards.

Benchmarks below are for square single-precision GEMMs. CPU measurements used an AMD Ryzen 9 7950X with NumPy linked against OpenBLAS 0.3.26; GPU measurements used an NVIDIA RTX 5090 with PyTorch 2.10.0.

Size

NumPy

Dr.Jit CPU

Rel.

128

263 GFLOP/s

247 GFLOP/s

94%

256

781 GFLOP/s

730 GFLOP/s

93%

512

1047 GFLOP/s

1046 GFLOP/s

100%

1024

1744 GFLOP/s

1830 GFLOP/s

105%

2048

2022 GFLOP/s

2108 GFLOP/s

104%

4096

1890 GFLOP/s

2134 GFLOP/s

113%

Size

PyTorch

Dr.Jit CUDA

Rel.

128

362 GFLOP/s

767 GFLOP/s

212%

256

2979 GFLOP/s

2811 GFLOP/s

94%

512

18725 GFLOP/s

10331 GFLOP/s

55%

1024

47663 GFLOP/s

31941 GFLOP/s

67%

2048

68759 GFLOP/s

50629 GFLOP/s

74%

4096

64250 GFLOP/s

48248 GFLOP/s

75%