This repository shows notebooks for experimenting with the computation of vector space embeddings and data maps of telemetry that aim to support cyber defense processes. All this code was used to generate the results and images shown in the eponymous SciPy 2024 talk (slides). It is provided here for curious folks to try these experiments by themselves, and perhaps even apply them to their own data.
I use Conda to put together the requisite computing environment. Simply use the included environment file by running
conda env create
If you intend to run this out of the included Jupyter Lab instance, you are good to go. If instead your workstation consists in a Jupyterhub server, you may need to install the environment as a bespoke kernel:
conda activate acme3-mapping
python -m ipykernel install --user --name acme3-mapping --display-name "ACME3 data maps"
It takes a few seconds for ACME3 data maps to show up as a kernel option you can select when starting a new notebook. From there, when you open any notebook from this repository for the first time, change its kernel to ACME3 data maps.
It is highly recommended to run the notebooks in numerical order, as any may expect to use results computed in a previous one.
- Gather and engineer dataset: this notebook guides you in downloading and filtering the stdview summary of the ACME3 dataset we embed and map.
- Command lines - Bags of words: we compute a simple bag-of-words embedding of command lines attached to processes, and produce an interactive map of the result.
- Command lines - Wasserstein embedding: we introduce the more sophisticated Wasserstein vector space embedding method, and apply it to command lines, resulting in an improved data map.
- Processes as bags of code images: this takes a different lense on process instances, examining them from patterns of similarity induced by loading similar sets of code images.
- Hosts as bags of processes over time: we pivot from comparative process analysis, and uses the process representation to construct and map an embedding of hosts running these processes, with a temporal component.
- Comparing data maps: this presents a methodology for appraising visually the differences between data maps induced by embedding methods.
You may discuss all this publicly by opening an issue. We reserve the right to ask you to submit a PR in support of your arguments. ;-)