This repository contains utilities to construct large LLVM IR datasets from multiple sources.
To get started with the dataset construction utilities, we'd suggest to use the packaged pipenv, or the packaged poetry to isolate the Python from your system isolation or other environments.
To get started with pipenv, you then have to
pipenv install
or if you seek to utilize the packaged lockfile
pipenv sync
After that you are ready to activate the environment, and install the dataset construction utilities into it
pipenv shell && pip install .
In case you want to develop the package, this becomes
pipenv shell && pip install -e .
To get started with poetry, you then have to
poetry install
which will draw the exact software version from the packaged lockfile, and install the editable version of the dataset construction utilities into the environment. To only install the dependencies, you can run
poetry install --no-root
To then develop inside of poetry's virtual environment, we can launch a shell with
poetry shell
To create your first small batch of IR data you then have to run from the root directory of the package
python3 ./llvm_ir_dataset_utils/tools/corpus_from_description.py \
--source_dir=/path/to/store/dataset/to/source \
--corpus_dir=/path/to/store/dataset/to/corpus \
--build_dir=/path/to/store/dataset/to/build \
--corpus_description=./corpus_descriptions_test/manual_tree.json
Beware! You'll need to have a version of
llvm-objcopy
on your$PATH
. If you are missingllvm-objcopy
, an easy way to obtain it is by downloading an llvm-release from either your preferred package channel such asapt
,dnf
orpacman
, or build llvm from source where only the LLVM project itself needs to be enabled during the build, i.e.-DLLVM_ENABLE_PROJECTS="llvm"
.
You'll then receive a set of .bc
files in /path/to/store/dataset/to/corpus/tree
, which you can convert with llvm-dis
into LLVM-IR, i.e. from inside of the folder
llvm-dis *.bc
Last steps into the dataloader to be described here.
Basics of the corpus description to be outlined here to easily enable someone to point the package at a new source.
The package contains a number of builders to target the LLVM-based languages, and extract IR:
- Individual projects (C/C++)
- Rust crates
- Spack packages
- Autoconf
- Cmake
- Julia packages
- Swift packages