-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathpresentation_notes_old.txt
39 lines (29 loc) · 2.35 KB
/
presentation_notes_old.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
dk:
bezier curves
linear interpolation
motivation
ks
cuda
one curve does not stress hardware that much, can easily render whole canvas pixel by pixel
few cuda kernel launches
certain operations can be slow do to unoptimized memory/cache access patterns
description:
naive
naive w/ cuda - massively increase performance, slightly increased if lerps are done on cpu
These three basically do minimal cpu processing, sending a giant tensor to cuda, launches a few kernels
and return
shrunk is the opposite end of the spectrum, each point gets sent separately and the gradient is calculated in a square around each point. This destroys performance since so much time is spent launching CUDA kernels for relatively cheap operations.
in order to increase performance, we can simply send less pixels, resulting in fewer FLOPs
to start, we find the minimum and maximum of the curve to draw a bounding box, only send that to the gpu, and insert it back in.
this actually decreases performance because, again, the GPU is more than capable of rendering the whole thing, and the additional indexing and processing can be expensive.
also tried drawing a tight bounding box by rotating the curve so the points align.
the multiplies and translations required hurt more than help
HalfTensors (float16) further increases performance, broadcasting replacing expand() also increases performace
expand tends to have very poor at certain tensor sizes due to cache access patterns
in order to furthur improve resource usage, we introduced the 'tiled' method
the full area is broken into N*N square tiles
each of these can be evaluated asynchronously using CUDA streams (the V100 has 80 effective streaming multiprocessors, and we can enqueue our elementwise multiplies on this to execute in order but overlapped
launch each kernel, execute, wait for sync
we also do an initial pass to collect the points that (together with its normal dist) cross the boundary of the tile
this way, we can ignore the tiles that do not contain any points
and each operation in the tile results in a tile_size * tile_size * points_in_tile operation instead of the original res * res * total_points