Lightweight profiling utilities for identifying bottlenecks and timing program parts in your python application.
Also supports async profiling for cuda.
- new install:
pip install kappaprofiler
- uprade to new version:
pip install kappaprofiler --upgrade
import kappaprofiler as kp
import time
@kp.profile
def main():
time.sleep(0.3) # simulate some operation
some_method()
@kp.profile
def some_method():
time.sleep(0.5) # simulate some operation
if __name__ == "__main__":
main()
print(kp.profiler.to_string())
The result will be (time.sleep calls are not 100% accurate)
0.82 main
0.51 main.some_method
import kappaprofiler as kp
import time
def main():
with kp.named_profile("main"):
time.sleep(0.3) # simulate some operation
with kp.named_profile("method"):
some_method()
with kp.named_profile("main2"):
time.sleep(0.2) # simulate some operation
def some_method():
time.sleep(0.5) # simulate some operation
if __name__ == "__main__":
main()
print(kp.profiler.to_string())
The result will be (time.sleep calls are not 100% accurate)
0.82 main
0.51 main.method
0.20 main2
Each profiling entry is represented by a node from which detailed information can be retrieved
query = "main.some_method"
node = kp.profiler.get_node(query)
print(f"{query} was called {node.count} time and took {node.to_string()} seconds in total")
main.some_method was called 1 time and took 0.51 seconds in total
import kappaprofiler as kp
with kp.Stopwatch() as sw:
# some operation
...
print(f"operation took {sw.elapsed_milliseconds} milliseconds")
print(f"operation took {sw.elapsed_seconds} seconds")
import kappaprofiler as kp
import time
sw1 = kp.Stopwatch()
sw2 = kp.Stopwatch()
for i in range(1, 3):
with sw1:
# operation1
time.sleep(0.1 * i)
with sw2:
# operation2
time.sleep(0.2 * i)
print(f"operation1 took {sw1.elapsed_seconds:.2f} seconds (average {sw1.average_lap_time:.2f})")
print(f"operation2 took {sw2.elapsed_seconds:.2f} seconds (average {sw2.average_lap_time:.2f})")
operation1 took 0.32 seconds (average 0.16)
operation2 took 0.61 seconds (average 0.30)
Showcase: timing cuda operations in pytorch
Asynchronous operations can only be timed properly when the asynchronous call is awaited or a synchronization point is created after the timing should end. Natively in pytorch this would look something like this:
# submit a start event to the event stream
start_event = torch.cuda.Event(enable_timing=True)
start_event.record()
# submit a async operation to the event stream
...
# submit a end event to the event stream
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()
# synchronize
torch.cuda.synchronize()
print(start_event.elapsed_time(end_event))
which is quite a lot of boilerplate for timing one operation.
With kappaprofiler it looks like this:
import kappaprofiler as kp
import torch
def main():
device = torch.device("cuda")
x = torch.randn(15000, 15000, device=device)
with kp.named_profile("matmul_wrong"):
# matrix multiplication (@) is asynchronous
_ = x @ x
# the timing for "matmul_wrong" is only the time it took to
# submit the x @ x operation to the cuda event stream
# not the actual time the x @ x operation took
with kp.named_profile_async("matmul_right"):
_ = x @ x
matmul_method(x)
@kp.profile_async
def matmul_method(x):
_ = x @ x
def start_async():
start_event = torch.cuda.Event(enable_timing=True)
start_event.record()
return start_event
def end_async(start_event):
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()
torch.cuda.synchronize()
# torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds
return start_event.elapsed_time(end_event) / 1000
if __name__ == "__main__":
kp.setup_async(start_async, end_async)
main()
print(kp.profiler.to_string())
0.56 matmul_wrong
4.69 matmul_right
4.72 matmul_method
NOTE: Synchronization points slow down overall program execution, so they should only be used for investigating bottlenecks/runtimes
To remove all synchronization points in your program either:
- remove the
kp.setup_async
call ->kp.named_profile_async
/kp.profile_async
will default to a noop (NOTE: this removes the node completely, so it's also not possible to query it) - replace the
kp.setup_async
call withkp.setup_async_as_sync
to make the asynchronous calls behave just like the synchronous calls. This will make the async times wrong (likematmul_wrong
above) but still creates a node for the operation (e.g. for querying how often it was called).
Only synchronizing cuda operations is not sufficient when multiple processes are used (e.g. for multi-gpu training). In addition to cuda synchronization, the processes have to be synced up.
import torch.distributed as dist
def end_async(start_event):
if dist.is_available() and dist.is_initialized():
torch.cuda.synchronize()
dist.barrier()
end_event = torch.cuda.Event(enable_timing=True)
end_event.record()
torch.cuda.synchronize()
# torch.cuda.Event.elapsed_time returns milliseconds but kappaprofiler expects seconds
return start_event.elapsed_time(end_event) / 1000