Benchmarking

Program execution using Dr.Jit is highly asynchronous. Because of this, some care must be taken when benchmarking programs.

How to do it

The recommended way to measure GPU kernel runtimes is the drjit.kernel_history() API. While the drjit.JitFlag.KernelHistory flag is set, Dr.Jit records performance metadata for every launched kernel, including a high-resolution, device-side timing (captured via CUDA events on CUDA, and GPU timestamps on Metal). This sidesteps the pitfalls described below.

# Record kernel metadata for the region of interest.
with dr.scoped_set_flag(dr.JitFlag.KernelHistory):
    y = f(x)
    dr.eval(y)

hist = dr.kernel_history()
total = sum(k["execution_time"] for k in hist)
print(f"{total:.3f} ms across {len(hist)} operations(s)")

Each entry is a dictionary; execution_time is given in milliseconds. See drjit.kernel_history() for the full set of fields (kernel hash, IR, cache hits, code generation and backend compilation times, etc.). Note that calling drjit.kernel_history() also clears the recorded history.

How not to do it

A code fragment like

start = time.end()
y = f(x)
end = time.end()

can run miraculously fast despite f(x) being an expensive computation. This is because Dr.Jit merely traced f(x) but did not yet run it. Adding a call to dr.eval() triggers an immediate evaluation:

start = time.end()
y = f(x)
dr.eval(y)
end = time.end()

However, this is still not be enough: variable evaluation itself is also asynchronous. This is normally not something that users have to think about, but it does become apparent when timing Python code. To force a synchronization, we could add a call to dr.sync_thread().

start = time.end()
y = f(x)
dr.eval(y)
dr.sync_thread()
end = time.end()

However, the above is an anti-pattern. It is bad because:

  1. It conflates CPU tracing and GPU kernel costs. Tracing cost can be avoided via function freezing.

  2. It tends to mistakenly include kernel compilation costs and synchronization latencies caused by uncached kernel loads and memory allocations at the start of an application. These would be avoided in practical usage involving a warm kernel and allocation cache.

  3. Measurements are noisy because of OS scheduling and CPU/GPU communication.

The recommended way to measure the runtime of a set of kernels is the drjit.kernel_history() API, which returns a list kernel calls with high-resolution timing data.

Events

For finer-grained control over synchronization, each JIT backend provides an Event class (drjit.llvm.Event, and the analogous drjit.cuda.Event / drjit.metal.Event; also reachable through the active backend as drjit.auto.Event). An event marks a point in the command stream and can be used to wait for, or poll, the completion of all work enqueued before it:

from drjit.auto import Event

event = Event()
y = f(x)
dr.eval(y)        # enqueue the work
event.record()    # mark this point in the stream

if not event.query():   # non-blocking completion check
    event.wait()        # block until the preceding work has finished

Beyond timing, events are a useful mechanism for handing off GPU work between two threads that drive the same GPU through Dr.Jit at the same time. One thread records an event once it has enqueued a batch of work; the other thread waits on that event before consuming the results, establishing the ordering without forcing a full drjit.sync_thread() (which would stall all outstanding work on the device). Because wait releases the GIL, the waiting thread does not block unrelated Python execution.

On the CUDA and LLVM backends, two timing-enabled events also measure the elapsed device-side time between them:

start, end = Event(), Event()
start.record()
y = f(x); dr.eval(y)
end.record()
print(f"{start.elapsed_time(end):.3f} ms")   # waits for 'end', then reports

Note

The Metal backend supports event synchronization but not timing: elapsed_time() raises, and the enable_timing constructor flag is ignored. Use drjit.kernel_history() for portable kernel timing.

Integration

Dr.Jit integrates with platform-native performance analysis tools, specifically NVIDIA Nsight Systems (CUDA) and Apple Instruments (Metal). You can use the following functions to visually mark points and regions of the program execution to more easily identify them in traces.

These annotate the CPU timeline. On CUDA they appear as NVTX marks/ranges in Nsight Systems; on Apple platforms they are emitted as os_signpost events and (nestable) intervals, which appear in the Points of Interest track of Instruments. This track is part of templates such as Time Profiler and System Trace (but not Metal System Trace, which records GPU activity only). All of these are no-ops when no profiling tool is attached, so they can be left in place in production code.

GPU capture

The drjit.profile_enable() context manager marks a region of the program for targeted GPU capture (via cuProfilerStart/cuProfilerStop on CUDA, and a MTLCaptureScope on Apple Metal). In both cases, the capture tool is told to record exactly this region rather than the whole program:

with dr.profile_enable():
    code_to_be_profiled()

The optional active argument enables or disables the region based on a runtime condition, which is convenient for capturing a single iteration of a loop without restructuring the surrounding code:

for i in range(n):
    with dr.profile_enable(active=(i == 2)):
        code_to_be_profiled()

Capturing from the command line

Both platforms can record a trace directly from a terminal. The resulting trace file can be analyzed in the respective GUI tools afterwards.

On CUDA, launch the program with the Nsight Systems CLI. The --capture-range=cudaProfilerApi option ties the recording to the drjit.profile_enable() region (i.e. the cuProfilerStart / cuProfilerStop pair), so only that region is captured:

# Profile the whole run (NVTX ranges from profile_range appear on the timeline)
nsys profile -o my-capture python myscript.py

# Restrict the recording to the dr.profile_enable() region(s)
nsys profile -c cudaProfilerApi -o my-capture python myscript.py

# Inspect afterwards
nsys-ui my-capture.nsys-rep

On Apple platforms, you must first set MTL_CAPTURE_ENABLED=1 in the environment. The drjit.profile_enable() region is then written to a .gputrace document on disk (default path drjit.gputrace, override with DRJIT_METAL_CAPTURE_PATH), which can be opened in Xcode:

# Capture the dr.profile_enable() region to drjit.gputrace
MTL_CAPTURE_ENABLED=1 python myscript.py
open drjit.gputrace

For a timeline view, record with the Instruments CLI instead. This does not require profile_enable or MTL_CAPTURE_ENABLED. Use the Metal System Trace template for GPU activity, or a template carrying the Points of Interest track (e.g. Time Profiler or System Trace) to see the os_signpost marks and ranges from drjit.profile_mark() / drjit.profile_range():

# GPU activity timeline
xctrace record --template "Metal System Trace" \
    --output gpu.trace --launch -- python myscript.py

# profile_mark / profile_range signposts (Points of Interest)
xctrace record --template "Time Profiler" \
    --output poi.trace --launch -- python myscript.py

open gpu.trace

Independently of either tool, the MTLCaptureScope created by profile_enable is also installed as the capture manager’s default scope. This is only relevant for interactive use: if you instead attach Xcode to the process and click the capture button in its toolbar, the capture is restricted to the profile_enable region.