.. py:module:: drjit .. _firststeps-py: First steps in Python ===================== Dr.Jit offers both Python and C++ interfaces. The majority of this documentation covers the Python interface. For differences that are specific to C++, see the :ref:`separate section ` on this. Installing Dr.Jit ----------------- The easiest way to obtain Dr.Jit is using `binary wheels `_, which we provide for officially supported Python versions and the most common platforms (Linux x86_64, Windows x86_64, macOS arm64/x86_64). To install Dr.Jit in this way, run .. code-block:: bash $ pip install drjit The remainder of this section walks through a simple example that makes use of various system features. In particular, we will render an image of a bumpy sphere expressed as a `signed distance function `_ using just-in-time compilation, random number generation, GPU texturing, loop recording, and automatic differentiation. You can follow the example by copy-pasting code to a python file or a `Juypter lab `_ instance (recommended). Importing Dr.Jit ---------------- Most Dr.Jit functionality resides in the ``drjit`` namespace that we typically associate with the ``dr`` alias for convenience. .. code-block:: python import drjit as dr Besides this, we must also choose a set of array types from a specific computation backend. Several choices are available: - ``drjit.cuda`` provides data types for GPU-accelerated parallel computing using `CUDA `_. - ``drjit.llvm`` provides data types for CPU-accelerated parallel computing using the `LLVM compiler infrastructure `_. - ``drjit.scalar`` provides simple data types for serial/scalar computation. Further backends (e.g. Apple Metal, Intel Xe) are planned in the future. The CUDA and LLVM backends also provide specialized aliases for derivative tracking using `automatic differentiation `_ (``drjit.cuda.ad`` and ``drjit.llvm.ad``). We will discuss them later as part of this tutorial. We begin by importing various components that will be used in the tutorial: .. code-block:: python from drjit.cuda import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop LLVM backend ^^^^^^^^^^^^ If you don't have a CUDA-compatible GPU, change ``drjit.cuda`` to ``drjit.llvm`` in the above import statement. In that case, note that LLVM 8 or newer must be installed on the system, which may require additional steps depending on your platform: - **Linux**: ``apt-get install llvm`` (or an equivalent command for your distribution.) - **macOS**: ``brew install llvm`` using `Homebrew `_. - **Windows**: run one of the `official installers `_ (many files can be downloaded from this page, look for ones with the pattern ``LLVM--win64.exe``). It is important that you let the installer adjust the ``%PATH%`` variable so that the file ``LLVM-C.dll`` can be found by Dr.Jit. With that out of the way, let's get back to the example. Signed distance functions and sphere tracing -------------------------------------------- A `signed distance function `_ is a function that specifies the distance to the nearest surface. It provides a convenient and general way of encoding 3D shape information. We will initially start with a simple SDF of a sphere with radius 1 that is centered around the origin: .. code-block:: python def sdf(p: Array3f) -> Float: return dr.norm(p) - 1 The function takes 3D points and returns the associated distance value. The type annotations are provided for clarity and can be omitted in practice. We can use an interactive Python prompt to pass an ``Array3f`` instance representing a 3D point into the function and observe the calculated distance. .. code-block:: pycon >>> sdf(Array3f(1, 2, 3)) [2.7416574954986572] The CUDA and LLVM backends of Dr.Jit *vectorize* and *parallelize* computation. This means that types like ``Float`` and ``Array3f`` typically hold many values at once that are used to perform simultaneous evaluations of a function. For example, we can compute the SDF at positions :math:`(0, 0, 0)` and :math:`(1, 2, 3)` in one combined step. .. code-block:: pycon >>> sdf(Array3f([0, 1], [0, 2], [0, 3])) [-1.0, 2.7416574954986572] To visualize the surface encoded by the SDF, we will use an algorithm called `sphere tracing `_. Given a ray with an origin :math:`\textbf{o}` and direction :math:`\textbf{d}`, sphere tracing evaluates :math:`\mathrm{sdf}(\textbf{o})` to find the distance of the nearest surface. The line segment connecting :math:`\textbf{o}` and :math:`\mathbf{o} + \mathbf{d}\cdot\mathrm{sdf}(\textbf{o})` is free of surfaces by construction, and the algorithm thus skips to the end of this interval. Further repetition of this recipe causes the method to either approach the nearest surface intersection :math:`\textbf{p}` (visualized below) or escape to infinity. .. image:: images/sdf.svg :width: 500 :align: center :alt: Sphere tracing :class: only-light .. image:: images/sdf-dark.svg :width: 500 :align: center :alt: Sphere tracing :class: only-dark light The following sphere tracer runs for 10 fixed iteration and lacks various common optimizations for simplicity. The function :py:func:`fma` performs a *fused multiply-add*, i.e., it evaluates ``fma(a, b, c) = a*b + c`` with reduced rounding error and better performance. .. code-block:: python def trace(o: Array3f, d: Array3f) -> Array3f: for i in range(10): o = dr.fma(d, sdf(o), o) return o So far, so good. Now suppose ``p = trace(o, d)`` finds an intersection ``p``. To use this information to create an image, we must *shade* it (i.e., assign an intensity value). Many different shading models exist; a simple approach is to compute inner product of the *surface normal* and the direction :math:`\mathbf{l}` towards a light source. Intuitively, the surface becomes brighter as it more directly faces the light source. In the case of a signed distance function, the surface normal at :math:`\mathbf{p}` is given by the gradient vector :math:`\nabla \mathrm{sdf}(\mathbf{p})` so that this shading model entails computing .. math:: \mathrm{max}\{0, \nabla \mathrm{sdf}(\mathbf{p}) \cdot \mathbf{l}\} The gradient can be estimated using central `finite differences `_ with step size ``eps=1e-3``, which yields the following rudimentary shading routine (we will improve upon it shortly). .. code-block:: python def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float: n = Array3f( sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]), sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]), sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps]) ) / (2 * eps) return dr.maximum(0, dr.dot(n, l)) To create an image, we must generate a set of rays that will be processed by these functions. We begin by creating a ``Float`` array with 1000 linearly spaced elements covering the interval :math:`[-1, 1]` and then expand this into a set of :math:`1000\times 1000` :math:`x` and :math:`y` grid coordinates. The :py:func:`linspace` and :py:func:`meshgrid` functions resemble their eponymous counterparts in array programming libraries like NumPy. .. code-block:: python x = dr.linspace(Float, -1, 1, 1000) x, y = dr.meshgrid(x, x) This is a good point for a small digression to explain a major difference to tools like NumPy. Tracing and delayed evaluation ------------------------------ In most array programming frameworks, the previous two commands would have created arrays representing actual data (grid coordinates in this example). Dr.Jit uses a different approach termed *tracing* to delay the evaluation of computation. In particular, no arithmetic took place during the two preceding steps: instead, Dr.Jit recorded a graph representing the sequence of steps that are needed to *eventually* compute ``x`` and ``y`` (which are represented by the bottom two nodes in the visualization below). .. image:: images/graph.png :width: 400 :align: center :alt: Computation graph of previous steps :class: only-light .. image:: images/graph-dark.png :width: 400 :align: center :alt: Computation graph of previous steps :class: only-dark .. note:: To view a computation graph like this on your own machine, you must install `GraphViz `_ on your system along with the `graphviz `_ Python package. Following this, you can run ``dr.graphviz().view()``. It is clear that the evaluation can not be postponed arbitrarily: we will eventually want to look at the generated image. At this point, Dr.Jit will take all recorded steps, compile them into an optimized *kernel*, and run it on the GPU or CPU. This all happens transparently behind the scenes. What are the benefits of doing things in this way? Merging multiple steps of a computation into a kernel (often called *fusion*) means that these steps can exchange information using fast register memory. This allows them to spend more time on the actual computation as opposed to reading and writing main memory (which is slow). Tracing also opens up other optimization opportunities explained in the `paper and video `_ explaining the system's design. Dr.Jit can trace enormously large programs without interruption and use the graph representation to simplify them. Example, continued ------------------ We will now use the previously computed grid points to define a virtual camera plane with pixel positions :math:`(x, y, 1)` relative to a pinhole at :math:`(0, 0, -2)` and simultaneously perform sphere tracing along every associated ray. .. code-block:: python p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1))) Next, we can shade the intersected points for light arriving from direction :math:`(0, -1, -1)`. Note the *masked assignment* at the bottom, which disables shading for rays that did not intersect anything. .. code-block:: python sh = shade(p, l=Array3f(0, -1, -1)) sh[sdf(p) > .1] = 0 We multiply and offset the shaded value with an ambient and highlight color. The resulting variable ``img`` associates an RGB color value with every pixel. .. code-block:: python img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh If you are used to array programming frameworks like NumPy/PyTorch, it may be tempting to think of ``img`` as a tensor that points to a ``3xN`` or ``Nx3``-shaped block of memory (where ``N`` is the pixel count). Dr.Jit instead traces computation for delayed evaluation, which means that no actual computation has occurred so far. The 3D array ``img`` (type :py:class:`drjit.cuda.Array3f`) consists of 3 components (``img.x``, ``img.y``, and ``img.z``) of type :py:class:`drjit.cuda.Float`, of which each represents an intermediate variable within a steadily growing program of the following high-level structure. .. code-block:: python # For illustration only, not part of the running example for i in range(1000000): # (in parallel) # .. earlier steps .. img_x = .1 + .4 * sh img_y = .1 + .4 * sh img_z = .2 + .2 * sh This program performs a parallel loop over :math:`1000\times1000` pixels. Subsequent Dr.Jit operations will simply add further steps to this program. For example, we can invoke :py:func:`ravel` to flatten the 3D array into a :py:class:`drjit.cuda.Float` array. .. code-block:: python img_flat = dr.ravel(img) Conceptually, this adds three more lines to the program .. code-block:: python # For illustration only, not part of the running example for i in range(1000000): # (in parallel) # .. earlier steps .. img_flat[i*3 + 0] = img_x img_flat[i*3 + 1] = img_y img_flat[i*3 + 2] = img_z This is essentially *metaprogramming*: running the program generates *another* program that will run at some later point and perform the actual computation. This all happens automatically and is key to the efficiency of Dr.Jit. Dr.Jit also supports arbitrarily sized tensors of various types (for example, :py:class:`drjit.cuda.TensorXf` for a CUDA ``float32`` tensor). Tensors are useful for *data exchange* with other array programming frameworks. For example, we can reshape the flat image buffer into a :math:`1000\times 1000\times 3` image tensor and then visualize it using `matplotlib `_. .. code-block:: python img_t = TensorXf(img_flat, shape=(1000, 1000, 3)) import matplotlib.pyplot as plt plt.imshow(img_t) plt.show() .. warning:: Despite the presence of a tensor type, Dr.Jit is *not* a tensor/array programming library. Heavy use of tensor operations like slice-based indexing may lead to poor performance, since they impede Dr.Jit's ability to *fuse* many operations into large kernels. Programs should be mainly written in terms of 1D arrays (:py:class:`drjit.cuda.Float`, :py:class:`drjit.cuda.UInt32`, :py:class:`drjit.cuda.Int64`, etc.) and fixed-size combinations. For example, :py:class:`drjit.cuda.Matrix4f` wraps :math:`4\times 4=16` :py:class:`drjit.cuda.Float` instances, each of which represents a variable in the program. The line ``plt.imshow(img_t)`` will access the image contents, and it is at this point that the traced program runs on the GPU, producing the following output: .. image:: images/sphere.png :width: 400 :align: center :alt: Computed image of a sphere :class: only-light .. image:: images/sphere-dark.png :width: 400 :align: center :alt: Computed image of a sphere :class: only-dark .. admonition:: Complete example code up to this point. :class: dropdown .. code-block:: python import drjit as dr from drjit.cuda import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop def sdf(p: Array3f) -> Float: return dr.norm(p) - 1 def trace(o: Array3f, d: Array3f) -> Array3f: for i in range(10): o = dr.fma(d, sdf(o), o) return o def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float: n = Array3f( sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]), sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]), sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps]) ) / (2 * eps) return dr.maximum(0, dr.dot(n, l)) x = dr.linspace(Float, -1, 1, 1000) x, y = dr.meshgrid(x, x) p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1))) sh = shade(p, l=Array3f(0, -1, -1)) sh[sdf(p) > .1] = 0 img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh img_flat = dr.ravel(img) img_t = TensorXf(img_flat, shape=(1000, 1000, 3)) import matplotlib.pyplot as plt plt.imshow(img_t) plt.show() Textures, random number generation ---------------------------------- This previous example was a little bland—let's make it more interesting! We will deform the sphere by perturbing the implicitly defined surface with a noise function. Dr.Jit was originally designed for `Monte Carlo methods `_ that heavily rely on random sampling, and it ships with Melissa O'Neill's `PCG32 `_ pseudorandom number generator to help with such applications. Here, we use PCG32 to generate a relatively small set of uniformly distributed variates covering the interval :math:`[0, 1]`. .. code-block:: python noise = PCG32(size=16*16*16).next_float32() We can then create a noise texture from these uniform variates. The command below allocates a 3D texture with a resolution of :math:`16\times16\times 16` and :math:`1` color channel. .. code-block:: python noise_tex = Texture3f(TensorXf(noise, shape=(16, 16, 16, 1))) We finally replace the ``sdf()`` function with a modified version that evaluates the texture with an offset and scaled value of ``p`` to slightly perturb the level set. This uses the GPU texture units on the CUDA backend and a software-interpolated lookup in the LLVM backend. .. code-block:: python def sdf(p: Array3f) -> Float: sdf_value = dr.norm(p) - 1 sdf_value += noise_tex.eval(dr.fma(p, 0.5, 0.5))[0] * 0.1 return sdf_value Let us also add the following line at the beginning of the program, which causes Dr.Jit to emit a brief message whenever it compiles and runs a kernel. .. code-block:: python dr.set_log_level(dr.LogLevel.Info) Re-running the program produces the following output: .. image:: images/sphere2.png :width: 400 :align: center :alt: Computed image of a sphere with trilinear noise :class: only-light .. image:: images/sphere2-dark.png :width: 400 :align: center :alt: Computed image of a sphere with trilinear noise :class: only-dark Why does it look so *faceted*? The texture uses trilinear interpolation, and the surface normal is given by the *derivative* of the interpolant (meaning that it will be *piecewise constant*). Dr.Jit also provides higher-order tricubic interpolation that internally reduces to eight hardware-accelerated texture lookups. We can use it to redefined ``sdf()`` once more: .. code-block:: python def sdf(p: Array3f) -> Float: sdf_value = dr.norm(p) - 1 sdf_value += noise_tex.eval_cubic(dr.fma(p, 0.5, 0.5))[0] * 0.1 return sdf_value With this implementation, we obtain a smooth bumpy sphere. .. image:: images/sphere3.png :width: 400 :align: center :alt: Computed image of a sphere with tricubic noise :class: only-light .. image:: images/sphere3-dark.png :width: 400 :align: center :alt: Computed image of a sphere with tricubic noise :class: only-dark .. admonition:: Complete example code up to this point. :class: dropdown .. code-block:: python import drjit as dr from drjit.cuda import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop dr.set_log_level(dr.LogLevel.Info) noise = PCG32(size=16*16*16).next_float32() noise_tex = Texture3f(TensorXf(noise, shape=(16, 16, 16, 1))) def sdf(p: Array3f) -> Float: sdf_value = dr.norm(p) - 1 sdf_value += noise_tex.eval_cubic(dr.fma(p, 0.5, 0.5))[0] * 0.1 return sdf_value def trace(o: Array3f, d: Array3f) -> Array3f: for i in range(10): o = dr.fma(d, sdf(o), o) return o def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float: n = Array3f( sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]), sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]), sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps]) ) / (2 * eps) return dr.maximum(0, dr.dot(n, l)) x = dr.linspace(Float, -1, 1, 1000) x, y = dr.meshgrid(x, x) p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1))) sh = shade(p, l=Array3f(0, -1, -1)) sh[sdf(p) > .1] = 0 img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh img_flat = dr.ravel(img) img_t = TensorXf(img_flat, shape=(1000, 1000, 3)) import matplotlib.pyplot as plt plt.imshow(img_t) plt.show() Kernel launches, caching ------------------------ Besides generating an image, the last experiment also produced several log messages enabled by the call to :py:func:`dr.set_log_level()`. .. code-block:: pycon :emphasize-lines: 2, 3, 6, 7 jit_eval(): launching 1 kernel. -> launching 17509add1324abde (n=4096, in=0, out=1, ops=41, jit=15.073 us): cache miss, build: 576.932 us, 3.375 KiB. jit_eval(): done. jit_eval(): launching 1 kernel. -> launching 87908afce75f85b5 (n=1000000, in=5, out=0, se=3, ops=2114, jit=330.965 us): cache miss, build: 1.17021 ms, 30.38 KiB. jit_eval(): done. Several things are noteworthy here: - Dr.Jit launched *two* kernels: the first one to compute the noise texture with ``n=4096`` texels, followed by the main rendering step that computed ``n=1000000`` image pixels. - The second kernel is *big* and contains over two thousand operations (``ops=2114``). - It generated those kernels for the first time (``cache miss``) and so had to perform a somewhat expensive compilation step to generate machine code. If you re-run the example a second time, this part of the message will change to ``cache hit``, and the compilation is skipped. Dr.Jit stores cached kernels on disk in the ``~/.drjit`` directory on Linux/macOS, and in ``~/AppData/Local/Temp/drjit`` on Windows. Dr.Jit was originally designed to accelerate gradient-based optimization; caching is particularly useful in this context, since the expensive compilation step will only run once during the first gradient step. - If you are using the LLVM backend, the kernel will be even larger.. .. code-block:: pycon :emphasize-lines: 2, 3 jit_eval(): launching 1 kernel. -> launching 6e8cadb52477dd91 (n=1000000, in=5, out=0, se=3, ops=7560, jit=2.92385 ms): cache miss, build: 2.411 s, 78.25 KiB. jit_eval(): done. The CPU does not have hardware texturing instructions and must emulate them, which causes this size increase to over 7K instructions. While tracing is fast (2.9 milliseconds), the one-time compilation step now takes almost 2.5 seconds! What leads to these large kernels? Not only does the bumpy sphere SDF generate more code: Dr.Jit's computation graph also contains it a whopping 17 times: 10 times for sphere tracing steps, 6 times for finite differences-based normal computation, and one final time for the masked assignment that disables pixels without valid intersections. This doesn't seem like a good way of using the system—let's improve the example! Recorded loops -------------- A first inefficiency is that a normal Python ``for`` loop will unroll the loop many times, producing an unnecessarily large trace that is expensive to compile. It is also inflexible: there is no easy way to to stop the sphere tracing iteration early when it is sufficiently close to the surface. Dr.Jit provides a *recorded loop* primitive to address these and related limitations. To use it, replace the earlier sphere tracing implementation .. code-block:: python # Old version def trace(o: Array3f, d: Array3f) -> Array3f: for i in range(10): o = dr.fma(d, sdf(o), o) return o by the following improved version: .. code-block:: python # Improved version def trace(o: Array3f, d: Array3f) -> Array3f: i = UInt32(0) loop = Loop("Sphere tracing", lambda: (o, i)) while loop(i < 10): o = dr.fma(d, sdf(o), o) i += 1 return o Expressed in this form, Dr.Jit will only trace the body *once* and make note of the fact that it must loop on the device while the condition ``i < 10`` holds. The condition is itself a Dr.Jit array, and elements can therefore run the loop for different numbers of iterations. For this all to work correctly, Dr.Jit needs to know what variables are modified by the loop body. The ``lambda: (o, i)`` parameter serves this role and allows the system to detect when variables are changed or entirely overwritten. The label ``"Sphere tracing"`` will be added to generated PTX/LLVM code and can be helpful when looking at kernels of programs containing many loops. This simple change reduces the operation count to half. Automatic differentiation ------------------------- Next, we can examine the ``shade()`` method that evaluated the SDF 6 times to compute an approximate derivative, which was a source of inefficiency: .. code-block:: python # Old version def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float: n = Array3f( sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]), sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]), sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps]) ) / (2 * eps) return dr.maximum(0, dr.dot(n, l)) Dr.Jit includes an `automatic differentiation `_ layer to analytically differentiate expressions, producing code that is more efficient *and* more accurate. To use the AD layer, simple append ``.ad`` to the import directive at the top of the program. For example for the CUDA backend, you would write: .. code-block:: python from drjit.cuda.ad import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop There is essentially no extra cost for using types from the ``.ad`` namespace when gradient tracking isn't explicitly enabled for a variable, so you can simply use them everywhere by default. The AD version of ``shade()`` invokes :py:func:`drjit.enable_grad()` to track the differential dependence of subsequent variables on the position ``p``. It subsequently evaluates the SDF just once, which records the structure of the computation into a graph representation. The next two lines set an input gradient at ``p`` and propagate the derivative to the output ``value``, which results in the desired directional derivative :math:`\nabla \mathrm{sdf}(\mathbf{p}) \cdot \mathbf{l}`. .. code-block:: python # Improved version def shade(p: Array3f, l: Array3f) -> Float: dr.enable_grad(p) value = sdf(p) dr.set_grad(p, l) dr.forward_to(value) return dr.maximum(0, dr.grad(value)) The :py:func:`dr.forward_to()` call materializes the AD-based derivatives into ordinary computation that is traced along with the rest of the program. This reduces the operation count by another factor of 2, and compilation time is now consistently between 30-90 milliseconds across backends. .. admonition:: Complete example code including optimizations :class: dropdown .. code-block:: python import drjit as dr from drjit.cuda.ad import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop dr.set_log_level(dr.LogLevel.Info) noise = PCG32(size=16*16*16).next_float32() noise_tex = Texture3f(TensorXf(noise, shape=(16, 16, 16, 1))) def sdf(p: Array3f) -> Float: sdf_value = dr.norm(p) - 1 sdf_value += noise_tex.eval_cubic(dr.fma(p, 0.5, 0.5))[0] * 0.1 return sdf_value def trace(o: Array3f, d: Array3f) -> Array3f: i = UInt32(0) loop = Loop("Sphere tracing", lambda: (o, i)) while loop(i < 10): o = dr.fma(d, sdf(o), o) i += 1 return o def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float: dr.enable_grad(p) value = sdf(p); dr.set_grad(p, l) dr.forward_to(value) return dr.maximum(0, dr.grad(value)) x = dr.linspace(Float, -1, 1, 1000) x, y = dr.meshgrid(x, x) p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1))) sh = shade(p, l=Array3f(0, -1, -1)) sh[sdf(p) > .1] = 0 img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh img_flat = dr.ravel(img) img_t = TensorXf(img_flat, shape=(1000, 1000, 3)) import matplotlib.pyplot as plt plt.imshow(img_t) plt.show() Dr.Jit can propagate derivatives in forward mode (shown here) and reverse mode, which is useful for gradient-based optimization of programs with many inputs. This concludes the running example. For those interested in the nitty-gritty details and quality of the generated code, we include an example of the PTX output produced by Dr.Jit below. .. admonition:: PTX intermediate representation produced by this example :class: dropdown .. code-block:: asm .version 6.0 .target sm_60 .address_size 64 .entry drjit_b6460a9f61ed83ee22ab62b4db19ee5b(.param .align 8 .b8 params[48]) { .reg.b8 %b <398>; .reg.b16 %w<398>; .reg.b32 %r<398>; .reg.b64 %rd<398>; .reg.f32 %f<398>; .reg.f64 %d<398>; .reg.pred %p <398>; mov.u32 %r0, %ctaid.x; mov.u32 %r1, %ntid.x; mov.u32 %r2, %tid.x; mad.lo.u32 %r0, %r0, %r1, %r2; ld.param.u32 %r2, [params]; setp.ge.u32 %p0, %r0, %r2; @%p0 bra done; mov.u32 %r3, %nctaid.x; mul.lo.u32 %r1, %r3, %r1; body: // sm_75 ld.param.u64 %rd0, [params+8]; ldu.global.u32 %r4, [%rd0]; ld.param.u64 %rd0, [params+16]; ldu.global.u32 %r5, [%rd0]; ld.param.u64 %rd0, [params+24]; ldu.global.u32 %r6, [%rd0]; ld.param.u64 %rd7, [params+32]; mov.b32 %f8, 0x3b033405; mov.b32 %f9, 0xbf800000; mov.u32 %r10, %r0; mov.b32 %r11, 0x624dd30; mul.hi.u32 %r12, %r11, %r10; sub.u32 %r13, %r10, %r12; mov.b32 %r14, 0x1; shr.b32 %r15, %r13, %r14; add.u32 %r16, %r15, %r12; mov.b32 %r17, 0x9; shr.b32 %r18, %r16, %r17; neg.s32 %r19, %r18; mov.b32 %r20, 0x3e8; mad.lo.u32 %r21, %r19, %r20, %r10; cvt.rn.f32.u32 %f22, %r18; fma.rn.ftz.f32 %f23, %f22, %f8, %f9; cvt.rn.f32.u32 %f24, %r21; fma.rn.ftz.f32 %f25, %f24, %f8, %f9; mov.b32 %f26, 0x0; mov.b32 %f27, 0xc0000000; mov.b32 %f28, 0x3f800000; mul.ftz.f32 %f29, %f25, %f25; fma.rn.ftz.f32 %f30, %f23, %f23, %f29; add.ftz.f32 %f31, %f28, %f30; rsqrt.approx.ftz.f32 %f32, %f31; mul.ftz.f32 %f33, %f25, %f32; mul.ftz.f32 %f34, %f23, %f32; mov.b32 %r35, 0x0; // Loop (Sphere tracing) [in 0, cond] mov.f32 %f36, %f26; // Loop (Sphere tracing) [in 1, cond] mov.f32 %f37, %f26; // Loop (Sphere tracing) [in 2, cond] mov.f32 %f38, %f27; // Loop (Sphere tracing) [in 3, cond] mov.u32 %r39, %r35; l_40_cond: // Loop (Sphere tracing) mov.b32 %r41, 0xa; setp.lo.u32 %p42, %r39, %r41; @!%p42 bra l_40_done; l_40_body: // Loop (Sphere tracing) [in 0, body] mov.f32 %f44, %f36; // Loop (Sphere tracing) [in 1, body] mov.f32 %f45, %f37; // Loop (Sphere tracing) [in 2, body] mov.f32 %f46, %f38; // Loop (Sphere tracing) [in 3, body] mov.u32 %r47, %r39; mul.ftz.f32 %f48, %f44, %f44; fma.rn.ftz.f32 %f49, %f45, %f45, %f48; fma.rn.ftz.f32 %f50, %f46, %f46, %f49; sqrt.approx.ftz.f32 %f51, %f50; mov.b32 %f52, 0x3f800000; sub.ftz.f32 %f53, %f51, %f52; mov.b32 %f54, 0x3f000000; fma.rn.ftz.f32 %f55, %f44, %f54, %f54; fma.rn.ftz.f32 %f56, %f45, %f54, %f54; fma.rn.ftz.f32 %f57, %f46, %f54, %f54; mov.pred %p58, 0x1; cvt.rn.f32.u32 %f59, %r6; cvt.rn.f32.u32 %f60, %r5; cvt.rn.f32.u32 %f61, %r4; mov.b32 %f62, 0xbf000000; fma.rn.ftz.f32 %f63, %f55, %f59, %f62; fma.rn.ftz.f32 %f64, %f56, %f60, %f62; fma.rn.ftz.f32 %f65, %f57, %f61, %f62; cvt.rmi.f32.f32 %f66, %f63; cvt.rzi.s32.f32 %r67, %f66; cvt.rmi.f32.f32 %f68, %f64; cvt.rzi.s32.f32 %r69, %f68; cvt.rmi.f32.f32 %f70, %f65; cvt.rzi.s32.f32 %r71, %f70; cvt.rn.f32.s32 %f72, %r67; cvt.rn.f32.s32 %f73, %r69; cvt.rn.f32.s32 %f74, %r71; sub.ftz.f32 %f75, %f63, %f72; sub.ftz.f32 %f76, %f64, %f73; sub.ftz.f32 %f77, %f65, %f74; rcp.approx.ftz.f32 %f78, %f59; rcp.approx.ftz.f32 %f79, %f60; rcp.approx.ftz.f32 %f80, %f61; mul.ftz.f32 %f81, %f75, %f75; mul.ftz.f32 %f82, %f81, %f75; mov.b32 %f83, 0x3e2aaaab; neg.ftz.f32 %f84, %f82; mov.b32 %f85, 0x40400000; mul.ftz.f32 %f86, %f85, %f81; add.ftz.f32 %f87, %f84, %f86; mul.ftz.f32 %f88, %f85, %f75; sub.ftz.f32 %f89, %f87, %f88; add.ftz.f32 %f90, %f89, %f52; mul.ftz.f32 %f91, %f90, %f83; mul.ftz.f32 %f92, %f85, %f82; mov.b32 %f93, 0x40c00000; mul.ftz.f32 %f94, %f93, %f81; sub.ftz.f32 %f95, %f92, %f94; mov.b32 %f96, 0x40800000; add.ftz.f32 %f97, %f95, %f96; mul.ftz.f32 %f98, %f97, %f83; mul.ftz.f32 %f99, %f82, %f83; add.ftz.f32 %f100, %f91, %f98; sub.ftz.f32 %f101, %f52, %f100; sub.ftz.f32 %f102, %f72, %f54; div.approx.ftz.f32 %f103, %f98, %f100; add.ftz.f32 %f104, %f102, %f103; mul.ftz.f32 %f105, %f104, %f78; mov.b32 %f106, 0x3fc00000; add.ftz.f32 %f107, %f72, %f106; div.approx.ftz.f32 %f108, %f99, %f101; add.ftz.f32 %f109, %f107, %f108; mul.ftz.f32 %f110, %f109, %f78; mul.ftz.f32 %f111, %f76, %f76; mul.ftz.f32 %f112, %f111, %f76; neg.ftz.f32 %f113, %f112; mul.ftz.f32 %f114, %f85, %f111; add.ftz.f32 %f115, %f113, %f114; mul.ftz.f32 %f116, %f85, %f76; sub.ftz.f32 %f117, %f115, %f116; add.ftz.f32 %f118, %f117, %f52; mul.ftz.f32 %f119, %f118, %f83; mul.ftz.f32 %f120, %f85, %f112; mul.ftz.f32 %f121, %f93, %f111; sub.ftz.f32 %f122, %f120, %f121; add.ftz.f32 %f123, %f122, %f96; mul.ftz.f32 %f124, %f123, %f83; mul.ftz.f32 %f125, %f112, %f83; add.ftz.f32 %f126, %f119, %f124; sub.ftz.f32 %f127, %f52, %f126; sub.ftz.f32 %f128, %f73, %f54; div.approx.ftz.f32 %f129, %f124, %f126; add.ftz.f32 %f130, %f128, %f129; mul.ftz.f32 %f131, %f130, %f79; add.ftz.f32 %f132, %f73, %f106; div.approx.ftz.f32 %f133, %f125, %f127; add.ftz.f32 %f134, %f132, %f133; mul.ftz.f32 %f135, %f134, %f79; mul.ftz.f32 %f136, %f77, %f77; mul.ftz.f32 %f137, %f136, %f77; neg.ftz.f32 %f138, %f137; mul.ftz.f32 %f139, %f85, %f136; add.ftz.f32 %f140, %f138, %f139; mul.ftz.f32 %f141, %f85, %f77; sub.ftz.f32 %f142, %f140, %f141; add.ftz.f32 %f143, %f142, %f52; mul.ftz.f32 %f144, %f143, %f83; mul.ftz.f32 %f145, %f85, %f137; mul.ftz.f32 %f146, %f93, %f136; sub.ftz.f32 %f147, %f145, %f146; add.ftz.f32 %f148, %f147, %f96; mul.ftz.f32 %f149, %f148, %f83; mul.ftz.f32 %f150, %f137, %f83; add.ftz.f32 %f151, %f144, %f149; sub.ftz.f32 %f152, %f52, %f151; sub.ftz.f32 %f153, %f74, %f54; div.approx.ftz.f32 %f154, %f149, %f151; add.ftz.f32 %f155, %f153, %f154; mul.ftz.f32 %f156, %f155, %f80; add.ftz.f32 %f157, %f74, %f106; div.approx.ftz.f32 %f158, %f150, %f152; add.ftz.f32 %f159, %f157, %f158; mul.ftz.f32 %f160, %f159, %f80; .reg.v4.f32 %u161; mov.v4.f32 %u161, { %f105, %f131, %f156, %f156 }; .reg.v4.f32 %u162; @%p58 tex.3d.v4.f32.f32 %u162, [%rd7, %u161]; @!%p58 mov.v4.f32 %u162, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f163, %u162.r; .reg.v4.f32 %u164; mov.v4.f32 %u164, { %f105, %f131, %f160, %f160 }; .reg.v4.f32 %u165; @%p58 tex.3d.v4.f32.f32 %u165, [%rd7, %u164]; @!%p58 mov.v4.f32 %u165, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f166, %u165.r; .reg.v4.f32 %u167; mov.v4.f32 %u167, { %f105, %f135, %f156, %f156 }; .reg.v4.f32 %u168; @%p58 tex.3d.v4.f32.f32 %u168, [%rd7, %u167]; @!%p58 mov.v4.f32 %u168, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f169, %u168.r; .reg.v4.f32 %u170; mov.v4.f32 %u170, { %f105, %f135, %f160, %f160 }; .reg.v4.f32 %u171; @%p58 tex.3d.v4.f32.f32 %u171, [%rd7, %u170]; @!%p58 mov.v4.f32 %u171, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f172, %u171.r; .reg.v4.f32 %u173; mov.v4.f32 %u173, { %f110, %f131, %f156, %f156 }; .reg.v4.f32 %u174; @%p58 tex.3d.v4.f32.f32 %u174, [%rd7, %u173]; @!%p58 mov.v4.f32 %u174, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f175, %u174.r; .reg.v4.f32 %u176; mov.v4.f32 %u176, { %f110, %f131, %f160, %f160 }; .reg.v4.f32 %u177; @%p58 tex.3d.v4.f32.f32 %u177, [%rd7, %u176]; @!%p58 mov.v4.f32 %u177, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f178, %u177.r; .reg.v4.f32 %u179; mov.v4.f32 %u179, { %f110, %f135, %f156, %f156 }; .reg.v4.f32 %u180; @%p58 tex.3d.v4.f32.f32 %u180, [%rd7, %u179]; @!%p58 mov.v4.f32 %u180, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f181, %u180.r; .reg.v4.f32 %u182; mov.v4.f32 %u182, { %f110, %f135, %f160, %f160 }; .reg.v4.f32 %u183; @%p58 tex.3d.v4.f32.f32 %u183, [%rd7, %u182]; @!%p58 mov.v4.f32 %u183, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f184, %u183.r; neg.ftz.f32 %f185, %f151; fma.rn.ftz.f32 %f186, %f166, %f185, %f166; fma.rn.ftz.f32 %f187, %f163, %f151, %f186; fma.rn.ftz.f32 %f188, %f172, %f185, %f172; fma.rn.ftz.f32 %f189, %f169, %f151, %f188; fma.rn.ftz.f32 %f190, %f178, %f185, %f178; fma.rn.ftz.f32 %f191, %f175, %f151, %f190; fma.rn.ftz.f32 %f192, %f184, %f185, %f184; fma.rn.ftz.f32 %f193, %f181, %f151, %f192; neg.ftz.f32 %f194, %f126; fma.rn.ftz.f32 %f195, %f189, %f194, %f189; fma.rn.ftz.f32 %f196, %f187, %f126, %f195; fma.rn.ftz.f32 %f197, %f193, %f194, %f193; fma.rn.ftz.f32 %f198, %f191, %f126, %f197; neg.ftz.f32 %f199, %f100; fma.rn.ftz.f32 %f200, %f198, %f199, %f198; fma.rn.ftz.f32 %f201, %f196, %f100, %f200; mov.b32 %f202, 0x3dcccccd; mul.ftz.f32 %f203, %f201, %f202; add.ftz.f32 %f204, %f53, %f203; fma.rn.ftz.f32 %f205, %f33, %f204, %f44; fma.rn.ftz.f32 %f206, %f34, %f204, %f45; fma.rn.ftz.f32 %f207, %f32, %f204, %f46; mov.b32 %r208, 0x1; add.u32 %r209, %r47, %r208; mov.f32 %f36, %f205; mov.f32 %f37, %f206; mov.f32 %f38, %f207; mov.u32 %r39, %r209; bra l_40_cond; l_40_done: // Loop (Sphere tracing) [out 0] mov.f32 %f211, %f36; // Loop (Sphere tracing) [out 1] mov.f32 %f212, %f37; // Loop (Sphere tracing) [out 2] mov.f32 %f213, %f38; mul.ftz.f32 %f214, %f211, %f211; fma.rn.ftz.f32 %f215, %f212, %f212, %f214; fma.rn.ftz.f32 %f216, %f213, %f213, %f215; sqrt.approx.ftz.f32 %f217, %f216; rcp.approx.ftz.f32 %f218, %f217; mov.b32 %f219, 0x3f000000; mul.ftz.f32 %f220, %f219, %f218; mul.ftz.f32 %f221, %f9, %f212; fma.rn.ftz.f32 %f222, %f9, %f212, %f221; mul.ftz.f32 %f223, %f9, %f213; fma.rn.ftz.f32 %f224, %f9, %f213, %f223; add.ftz.f32 %f225, %f222, %f224; setp.eq.f32 %p226, %f225, %f26; selp.f32 %f227, %f26, %f220, %p226; mul.ftz.f32 %f228, %f225, %f227; max.ftz.f32 %f229, %f26, %f228; sub.ftz.f32 %f230, %f217, %f28; fma.rn.ftz.f32 %f231, %f211, %f219, %f219; fma.rn.ftz.f32 %f232, %f212, %f219, %f219; fma.rn.ftz.f32 %f233, %f213, %f219, %f219; mov.pred %p234, 0x1; cvt.rn.f32.u32 %f235, %r6; cvt.rn.f32.u32 %f236, %r5; cvt.rn.f32.u32 %f237, %r4; mov.b32 %f238, 0xbf000000; fma.rn.ftz.f32 %f239, %f231, %f235, %f238; fma.rn.ftz.f32 %f240, %f232, %f236, %f238; fma.rn.ftz.f32 %f241, %f233, %f237, %f238; cvt.rmi.f32.f32 %f242, %f239; cvt.rzi.s32.f32 %r243, %f242; cvt.rmi.f32.f32 %f244, %f240; cvt.rzi.s32.f32 %r245, %f244; cvt.rmi.f32.f32 %f246, %f241; cvt.rzi.s32.f32 %r247, %f246; cvt.rn.f32.s32 %f248, %r243; cvt.rn.f32.s32 %f249, %r245; cvt.rn.f32.s32 %f250, %r247; sub.ftz.f32 %f251, %f239, %f248; sub.ftz.f32 %f252, %f240, %f249; sub.ftz.f32 %f253, %f241, %f250; rcp.approx.ftz.f32 %f254, %f235; rcp.approx.ftz.f32 %f255, %f236; rcp.approx.ftz.f32 %f256, %f237; mul.ftz.f32 %f257, %f251, %f251; mul.ftz.f32 %f258, %f257, %f251; mov.b32 %f259, 0x3e2aaaab; neg.ftz.f32 %f260, %f258; mov.b32 %f261, 0x40400000; mul.ftz.f32 %f262, %f261, %f257; add.ftz.f32 %f263, %f260, %f262; mul.ftz.f32 %f264, %f261, %f251; sub.ftz.f32 %f265, %f263, %f264; add.ftz.f32 %f266, %f265, %f28; mul.ftz.f32 %f267, %f266, %f259; mul.ftz.f32 %f268, %f261, %f258; mov.b32 %f269, 0x40c00000; mul.ftz.f32 %f270, %f269, %f257; sub.ftz.f32 %f271, %f268, %f270; mov.b32 %f272, 0x40800000; add.ftz.f32 %f273, %f271, %f272; mul.ftz.f32 %f274, %f273, %f259; mul.ftz.f32 %f275, %f258, %f259; add.ftz.f32 %f276, %f267, %f274; sub.ftz.f32 %f277, %f28, %f276; sub.ftz.f32 %f278, %f248, %f219; div.approx.ftz.f32 %f279, %f274, %f276; add.ftz.f32 %f280, %f278, %f279; mul.ftz.f32 %f281, %f280, %f254; mov.b32 %f282, 0x3fc00000; add.ftz.f32 %f283, %f248, %f282; div.approx.ftz.f32 %f284, %f275, %f277; add.ftz.f32 %f285, %f283, %f284; mul.ftz.f32 %f286, %f285, %f254; mul.ftz.f32 %f287, %f252, %f252; mul.ftz.f32 %f288, %f287, %f252; neg.ftz.f32 %f289, %f288; mul.ftz.f32 %f290, %f261, %f287; add.ftz.f32 %f291, %f289, %f290; mul.ftz.f32 %f292, %f261, %f252; sub.ftz.f32 %f293, %f291, %f292; add.ftz.f32 %f294, %f293, %f28; mul.ftz.f32 %f295, %f294, %f259; mul.ftz.f32 %f296, %f261, %f288; mul.ftz.f32 %f297, %f269, %f287; sub.ftz.f32 %f298, %f296, %f297; add.ftz.f32 %f299, %f298, %f272; mul.ftz.f32 %f300, %f299, %f259; mul.ftz.f32 %f301, %f288, %f259; add.ftz.f32 %f302, %f295, %f300; sub.ftz.f32 %f303, %f28, %f302; sub.ftz.f32 %f304, %f249, %f219; div.approx.ftz.f32 %f305, %f300, %f302; add.ftz.f32 %f306, %f304, %f305; mul.ftz.f32 %f307, %f306, %f255; add.ftz.f32 %f308, %f249, %f282; div.approx.ftz.f32 %f309, %f301, %f303; add.ftz.f32 %f310, %f308, %f309; mul.ftz.f32 %f311, %f310, %f255; mul.ftz.f32 %f312, %f253, %f253; mul.ftz.f32 %f313, %f312, %f253; neg.ftz.f32 %f314, %f313; mul.ftz.f32 %f315, %f261, %f312; add.ftz.f32 %f316, %f314, %f315; mul.ftz.f32 %f317, %f261, %f253; sub.ftz.f32 %f318, %f316, %f317; add.ftz.f32 %f319, %f318, %f28; mul.ftz.f32 %f320, %f319, %f259; mul.ftz.f32 %f321, %f261, %f313; mul.ftz.f32 %f322, %f269, %f312; sub.ftz.f32 %f323, %f321, %f322; add.ftz.f32 %f324, %f323, %f272; mul.ftz.f32 %f325, %f324, %f259; mul.ftz.f32 %f326, %f313, %f259; add.ftz.f32 %f327, %f320, %f325; sub.ftz.f32 %f328, %f28, %f327; sub.ftz.f32 %f329, %f250, %f219; div.approx.ftz.f32 %f330, %f325, %f327; add.ftz.f32 %f331, %f329, %f330; mul.ftz.f32 %f332, %f331, %f256; add.ftz.f32 %f333, %f250, %f282; div.approx.ftz.f32 %f334, %f326, %f328; add.ftz.f32 %f335, %f333, %f334; mul.ftz.f32 %f336, %f335, %f256; .reg.v4.f32 %u337; mov.v4.f32 %u337, { %f281, %f307, %f332, %f332 }; .reg.v4.f32 %u338; @%p234 tex.3d.v4.f32.f32 %u338, [%rd7, %u337]; @!%p234 mov.v4.f32 %u338, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f339, %u338.r; .reg.v4.f32 %u340; mov.v4.f32 %u340, { %f281, %f307, %f336, %f336 }; .reg.v4.f32 %u341; @%p234 tex.3d.v4.f32.f32 %u341, [%rd7, %u340]; @!%p234 mov.v4.f32 %u341, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f342, %u341.r; .reg.v4.f32 %u343; mov.v4.f32 %u343, { %f281, %f311, %f332, %f332 }; .reg.v4.f32 %u344; @%p234 tex.3d.v4.f32.f32 %u344, [%rd7, %u343]; @!%p234 mov.v4.f32 %u344, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f345, %u344.r; .reg.v4.f32 %u346; mov.v4.f32 %u346, { %f281, %f311, %f336, %f336 }; .reg.v4.f32 %u347; @%p234 tex.3d.v4.f32.f32 %u347, [%rd7, %u346]; @!%p234 mov.v4.f32 %u347, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f348, %u347.r; .reg.v4.f32 %u349; mov.v4.f32 %u349, { %f286, %f307, %f332, %f332 }; .reg.v4.f32 %u350; @%p234 tex.3d.v4.f32.f32 %u350, [%rd7, %u349]; @!%p234 mov.v4.f32 %u350, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f351, %u350.r; .reg.v4.f32 %u352; mov.v4.f32 %u352, { %f286, %f307, %f336, %f336 }; .reg.v4.f32 %u353; @%p234 tex.3d.v4.f32.f32 %u353, [%rd7, %u352]; @!%p234 mov.v4.f32 %u353, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f354, %u353.r; .reg.v4.f32 %u355; mov.v4.f32 %u355, { %f286, %f311, %f332, %f332 }; .reg.v4.f32 %u356; @%p234 tex.3d.v4.f32.f32 %u356, [%rd7, %u355]; @!%p234 mov.v4.f32 %u356, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f357, %u356.r; .reg.v4.f32 %u358; mov.v4.f32 %u358, { %f286, %f311, %f336, %f336 }; .reg.v4.f32 %u359; @%p234 tex.3d.v4.f32.f32 %u359, [%rd7, %u358]; @!%p234 mov.v4.f32 %u359, {0.0, 0.0, 0.0, 0.0}; mov.f32 %f360, %u359.r; neg.ftz.f32 %f361, %f327; fma.rn.ftz.f32 %f362, %f342, %f361, %f342; fma.rn.ftz.f32 %f363, %f339, %f327, %f362; fma.rn.ftz.f32 %f364, %f348, %f361, %f348; fma.rn.ftz.f32 %f365, %f345, %f327, %f364; fma.rn.ftz.f32 %f366, %f354, %f361, %f354; fma.rn.ftz.f32 %f367, %f351, %f327, %f366; fma.rn.ftz.f32 %f368, %f360, %f361, %f360; fma.rn.ftz.f32 %f369, %f357, %f327, %f368; neg.ftz.f32 %f370, %f302; fma.rn.ftz.f32 %f371, %f365, %f370, %f365; fma.rn.ftz.f32 %f372, %f363, %f302, %f371; fma.rn.ftz.f32 %f373, %f369, %f370, %f369; fma.rn.ftz.f32 %f374, %f367, %f302, %f373; neg.ftz.f32 %f375, %f276; fma.rn.ftz.f32 %f376, %f374, %f375, %f374; fma.rn.ftz.f32 %f377, %f372, %f276, %f376; mov.b32 %f378, 0x3dcccccd; mul.ftz.f32 %f379, %f377, %f378; add.ftz.f32 %f380, %f230, %f379; setp.gt.f32 %p381, %f380, %f378; selp.f32 %f382, %f26, %f229, %p381; mov.b32 %f383, 0x3e4ccccd; mov.b32 %f384, 0x3ecccccd; mul.ftz.f32 %f385, %f384, %f382; mul.ftz.f32 %f386, %f383, %f382; add.ftz.f32 %f387, %f378, %f385; add.ftz.f32 %f388, %f383, %f386; mov.b32 %r389, 0x3; mul.lo.u32 %r390, %r10, %r389; add.u32 %r391, %r390, %r14; mov.b32 %r392, 0x2; add.u32 %r393, %r390, %r392; ld.param.u64 %rd394, [params+40]; mad.wide.u32 %rd3, %r390, 4, %rd394; st.global.f32 [%rd3], %f387; mad.wide.u32 %rd3, %r391, 4, %rd394; st.global.f32 [%rd3], %f387; mad.wide.u32 %rd3, %r393, 4, %rd394; st.global.f32 [%rd3], %f388; add.u32 %r0, %r0, %r1; setp.ge.u32 %p0, %r0, %r2; @!%p0 bra body; done: ret; } Features -------- Many features weren't covered in this basic tutorial. Dr.Jit also - supports polymorphic/virtual function calls, in which a program jumps to one of many locations. It can efficiently trace and differentiate such indirection. - provides a library of transcendental functions (ordinary and hyperbolic trig functions, exponentials, logarithms, elliptic integrals, etc). - provides types for complex arithmetic, quaternions, and small (< :math:`4\times 4`) matrices. - provides efficient code for evaluating spherical harmonics.