Skip to content

TutteInstitute/acme3-mapping

Repository files navigation

Vector space embeddings and data maps for cyber defense

This repository shows notebooks for experimenting with the computation of vector space embeddings and data maps of telemetry that aim to support cyber defense processes. All this code was used to generate the results and images shown in the eponymous SciPy 2024 talk (slides). It is provided here for curious folks to try these experiments by themselves, and perhaps even apply them to their own data.

Setup

I use Conda to put together the requisite computing environment. Simply use the included environment file by running

conda env create

If you intend to run this out of the included Jupyter Lab instance, you are good to go. If instead your workstation consists in a Jupyterhub server, you may need to install the environment as a bespoke kernel:

conda activate acme3-mapping
python -m ipykernel install --user --name acme3-mapping --display-name "ACME3 data maps"

It takes a few seconds for ACME3 data maps to show up as a kernel option you can select when starting a new notebook. From there, when you open any notebook from this repository for the first time, change its kernel to ACME3 data maps.

Notebook index

It is highly recommended to run the notebooks in numerical order, as any may expect to use results computed in a previous one.

  1. Gather and engineer dataset: this notebook guides you in downloading and filtering the stdview summary of the ACME3 dataset we embed and map.
  2. Command lines - Bags of words: we compute a simple bag-of-words embedding of command lines attached to processes, and produce an interactive map of the result.
  3. Command lines - Wasserstein embedding: we introduce the more sophisticated Wasserstein vector space embedding method, and apply it to command lines, resulting in an improved data map.
  4. Processes as bags of code images: this takes a different lense on process instances, examining them from patterns of similarity induced by loading similar sets of code images.
  5. Hosts as bags of processes over time: we pivot from comparative process analysis, and uses the process representation to construct and map an embedding of hosts running these processes, with a temporal component.
  6. Comparing data maps: this presents a methodology for appraising visually the differences between data maps induced by embedding methods.

Issues and comments

You may discuss all this publicly by opening an issue. We reserve the right to ask you to submit a PR in support of your arguments. ;-)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published