All changes we make to the assignment code or PDF will be documented in this file.
- handout: clarify that
ddp_bucketed_benchmarking
doesn't require the full grid of runs.
- code: remove try-finally blocks in DDP tests.
- handout: remove outdated mention of a problem that doesn't exist on the assignment
- handout: fix Slurm environment variables in examples.
- handout: clarify assumptions in
ddp_bucketed_benchmarking
(b).
- code: remove
humanfriendly
from requirements.txt, addmatplotlib
- handout: modify problem
distributed_communication_multi_node
to specify that multinode measurements should be 2x1, 2x2, and 2x3. - handout: clarify that
torch.cuda.synchronize()
is necessary for timing collective communication ops, even when they are called withasync_op=False
.
- handout: fixed cut off text in problem memory_profiling (a)
- handout: fixed mismatch between slurm config and description text in section 3.2
- code: fix
ToyModelWithTiedWeights
to actually tie weights. - handout: fix typo in bucketed DDP test command, should be
pytest tests/test_ddp.py
- handout: fix deliverable of
ddp_overlap_individual_parameters_benchmarking
(a) to not ask for communication time, only end-to-end step time. - handout: clarify analysis in
optimizer_state_sharding_accounting
(a).
- handout: added a short question about variability on problem benchmarking_script
- handout: fixed typo in problem
triton_rmsnorm_forward
. The adapters should return the classes, not the.apply
attribute. - code: added
-e
flag to./cs336-systems/'[test]'
- handout: clarified recommendation about the timeit module
- handout: clarified question about kernel with highest CUDA total
Initial release.