Benchmarking¶
Program execution using Dr.Jit is highly asynchronous. Because of this, some care must be taken when benchmarking programs.
How to do it¶
The recommended way to measure GPU kernel runtimes is the
drjit.kernel_history() API. While the drjit.JitFlag.KernelHistory
flag is set, Dr.Jit records performance metadata for every launched kernel,
including a high-resolution, device-side timing (captured via CUDA events on
CUDA, and GPU timestamps on Metal). This sidesteps the pitfalls described below.
# Record kernel metadata for the region of interest.
with dr.scoped_set_flag(dr.JitFlag.KernelHistory):
y = f(x)
dr.eval(y)
hist = dr.kernel_history()
total = sum(k["execution_time"] for k in hist)
print(f"{total:.3f} ms across {len(hist)} operations(s)")
Each entry is a dictionary; execution_time is given in milliseconds. See
drjit.kernel_history() for the full set of fields (kernel hash, IR,
cache hits, code generation and backend compilation times, etc.). Note that
calling drjit.kernel_history() also clears the recorded history.
How not to do it¶
A code fragment like
start = time.end()
y = f(x)
end = time.end()
can run miraculously fast despite f(x) being an expensive computation. This
is because Dr.Jit merely traced f(x) but did not yet run it. Adding a call
to dr.eval() triggers an immediate evaluation:
start = time.end()
y = f(x)
dr.eval(y)
end = time.end()
However, this is still not be enough: variable evaluation itself is also
asynchronous. This is normally not something that users have to think about,
but it does become apparent when timing Python code. To force a synchronization,
we could add a call to dr.sync_thread().
start = time.end()
y = f(x)
dr.eval(y)
dr.sync_thread()
end = time.end()
However, the above is an anti-pattern. It is bad because:
It conflates CPU tracing and GPU kernel costs. Tracing cost can be avoided via function freezing.
It tends to mistakenly include kernel compilation costs and synchronization latencies caused by uncached kernel loads and memory allocations at the start of an application. These would be avoided in practical usage involving a warm kernel and allocation cache.
Measurements are noisy because of OS scheduling and CPU/GPU communication.
The recommended way to measure the runtime of a set of kernels is the
drjit.kernel_history() API, which returns a list kernel calls with
high-resolution timing data.
Events¶
For finer-grained control over synchronization, each JIT backend provides an
Event class (drjit.llvm.Event, and the analogous
drjit.cuda.Event / drjit.metal.Event; also reachable through the active
backend as drjit.auto.Event). An event marks a point in the command stream
and can be used to wait for, or poll, the completion of all work enqueued before
it:
from drjit.auto import Event
event = Event()
y = f(x)
dr.eval(y) # enqueue the work
event.record() # mark this point in the stream
if not event.query(): # non-blocking completion check
event.wait() # block until the preceding work has finished
Beyond timing, events are a useful mechanism for handing off GPU work between
two threads that drive the same GPU through Dr.Jit at the same time. One thread
records an event once it has enqueued a batch of work; the other thread waits on
that event before consuming the results, establishing the ordering without
forcing a full drjit.sync_thread() (which would stall all outstanding
work on the device). Because wait releases the
GIL, the waiting thread does not block unrelated Python execution.
On the CUDA and LLVM backends, two timing-enabled events also measure the elapsed device-side time between them:
start, end = Event(), Event()
start.record()
y = f(x); dr.eval(y)
end.record()
print(f"{start.elapsed_time(end):.3f} ms") # waits for 'end', then reports
Note
The Metal backend supports event synchronization but not timing:
elapsed_time() raises, and the enable_timing constructor flag is
ignored. Use drjit.kernel_history() for portable kernel timing.
Integration¶
Dr.Jit integrates with platform-native performance analysis tools, specifically NVIDIA Nsight Systems (CUDA) and Apple Instruments (Metal). You can use the following functions to visually mark points and regions of the program execution to more easily identify them in traces.
These annotate the CPU timeline. On CUDA they appear as NVTX marks/ranges in Nsight Systems; on Apple
platforms they are emitted as os_signpost events and (nestable) intervals,
which appear in the Points of Interest track of Instruments. This track is
part of templates such as Time Profiler and System Trace (but not Metal
System Trace, which records GPU activity only). All of these are no-ops when no
profiling tool is attached, so they can be left in place in production code.
GPU capture¶
The drjit.profile_enable() context manager marks a region of the
program for targeted GPU capture (via cuProfilerStart/cuProfilerStop
on CUDA, and a MTLCaptureScope on Apple Metal). In both cases, the capture
tool is told to record exactly this region rather than the whole program:
with dr.profile_enable():
code_to_be_profiled()
The optional active argument enables or disables the region based on a
runtime condition, which is convenient for capturing a single iteration of a
loop without restructuring the surrounding code:
for i in range(n):
with dr.profile_enable(active=(i == 2)):
code_to_be_profiled()
Capturing from the command line¶
Both platforms can record a trace directly from a terminal. The resulting trace file can be analyzed in the respective GUI tools afterwards.
On CUDA, launch the program with the Nsight Systems CLI. The
--capture-range=cudaProfilerApi option ties the recording to the
drjit.profile_enable() region (i.e. the cuProfilerStart /
cuProfilerStop pair), so only that region is captured:
# Profile the whole run (NVTX ranges from profile_range appear on the timeline)
nsys profile -o my-capture python myscript.py
# Restrict the recording to the dr.profile_enable() region(s)
nsys profile -c cudaProfilerApi -o my-capture python myscript.py
# Inspect afterwards
nsys-ui my-capture.nsys-rep
On Apple platforms, you must first set MTL_CAPTURE_ENABLED=1 in the
environment. The drjit.profile_enable() region is then written to a
.gputrace document on disk (default path drjit.gputrace, override with
DRJIT_METAL_CAPTURE_PATH), which can be opened in Xcode:
# Capture the dr.profile_enable() region to drjit.gputrace
MTL_CAPTURE_ENABLED=1 python myscript.py
open drjit.gputrace
For a timeline view, record with the Instruments CLI instead. This does not
require profile_enable or MTL_CAPTURE_ENABLED. Use the Metal System
Trace template for GPU activity, or a template carrying the Points of Interest
track (e.g. Time Profiler or System Trace) to see the os_signpost marks
and ranges from drjit.profile_mark() / drjit.profile_range():
# GPU activity timeline
xctrace record --template "Metal System Trace" \
--output gpu.trace --launch -- python myscript.py
# profile_mark / profile_range signposts (Points of Interest)
xctrace record --template "Time Profiler" \
--output poi.trace --launch -- python myscript.py
open gpu.trace
Independently of either tool, the MTLCaptureScope created by
profile_enable is also installed as the capture manager’s default scope.
This is only relevant for interactive use: if you instead attach Xcode to the
process and click the capture button in its toolbar, the capture is restricted
to the profile_enable region.