This simulator for cache systems part of High-Throughput Computing clusters has been developed as part of my master's thesis. As is custom for such a project, the code is severely underdocumented and undertested. Some modules are buggy and did not make it into evaluation. However, the most relevant concepts as well as the underlying presumptions are described in the thesis itself, I recommend reading (parts of) it before looking at the code here.
I have collected some open issues and ideas in the TODO.md
file, which I
personally would look at if I'd continue using the project.
Evaluation/analysis code lives in a separate repository
(simulator-analysis
).
Performs the workload generation phase of the simulator and writes an access sequence (trace) file.
Computes various extended statistics over an access sequence and writes this information to CSV files.
This command has a very high memory usage when enabling all output stats. A
full index of the access sequence constitutes a large part of this. It
requires (4 + 2 * length(parts)) * 8
bytes of memory for each access. C_0
has an average number of 2.55 parts per access leading to almost 73 bytes of
memory per access. However, because the re-uses of files occur within a
limited time interval (about 12 weeks for C_0), swapping to disk is feasible.
Performs the cache policy simulation phase of the simulator. It reads an access sequence from a file and simulates one or multiple cache processors according to specification passed to the command.
Using --seed
or elimnating randomness through parameter choices (e.g.
setting sigma = 0
) allows reproducibility.
Because randomness only occurs during computing of the schedule of workflows, any two parameter sets with the same seed are comparable as long as the schedule-affecting parameters are left unchanged.
There remain other aspects which differ between multiple executions
nevertheless. Notably, the file names (or file keys) generated by the
DataSet
class use the memory address of the DataSet
instance as part of
the name, which is different on each execution. Insignificant variations in
statistics have been observed as well, most likely due to follow-on effects of
the different file names or due to non-deterministic behaviour of the Python
interpreter.