The goal of this project is evaluation of a code2vec-based approach for authorship identification and exploring/solving issues of the existing datasets for authorship attribution of source code.
- De-anonymizing Programmers via Code Stylometry | source code
- Source Code Authorship Attribution using LSTM Based Networks | source code
- Authorship attribution of source code by using back propagation neural network based on particle swarm optimization | source code not available
- Google Code Jam submissions, C/C++/Python
- 40 authors, Java
- Projects mined from GitHub with a new data collection approach
The Java, C++, and Python datasets are also available here.
Data extraction pipeline consists of two modules: Gitminer written in Python and Pathminer written in Kotlin.
- Gitminer processes history of Git repository to extract all the blobs containing Java code.
- Pathminer uses GumTree to parse Java code and track method changes through repo's history.
- To extract data from GitHub projects, store names and links of GitHub projects in projects and
git_projects, respectively. Then, go to runner directory and run
run.py
Models and all the code for training/evaluation located in authorship_pipeline directory. To run experiments:
- Create configuration file manually (for examples see configs directory) or
edit and run
generate_configs.py
. - Run
python run_classification.py configs/path/to/your/config.yaml
- To draw graphs for evaluation on your project run
draw_graphs.py --project your_project_name
To run cross-validation on new data:
- Put source code files in
datasets/datasetName/{author}/{files}
. Make sure files of each author are in a single directory. - Run data extraction to mine path-contexts from the source files:
java -jar attribution/pathminer/extract-path-contexts.jar snapshot \
--project datasets/datasetName/ \
--output processed/datasetName/ \
--java-parser antlr \
--maxContexts 2000 --maxH 8 --maxW 3