Authorship detection

The goal of this project is evaluation of a code2vec-based approach for authorship identification and exploring/solving issues of the existing datasets for authorship attribution of source code.

Papers used for comparision

De-anonymizing Programmers via Code Stylometry | source code
Source Code Authorship Attribution using LSTM Based Networks | source code
Authorship attribution of source code by using back propagation neural network based on particle swarm optimization | source code not available

Datasets

Google Code Jam submissions, C/C++/Python
40 authors, Java
Projects mined from GitHub with a new data collection approach

The Java, C++, and Python datasets are also available here.

Project structure

Data extraction pipeline consists of two modules: Gitminer written in Python and Pathminer written in Kotlin.

Gitminer processes history of Git repository to extract all the blobs containing Java code.
Pathminer uses GumTree to parse Java code and track method changes through repo's history.
To extract data from GitHub projects, store names and links of GitHub projects in projects and git_projects, respectively. Then, go to runner directory and run run.py

Models and all the code for training/evaluation located in authorship_pipeline directory. To run experiments:

Create configuration file manually (for examples see configs directory) or edit and run generate_configs.py.
Run python run_classification.py configs/path/to/your/config.yaml
To draw graphs for evaluation on your project run draw_graphs.py --project your_project_name

To run cross-validation on new data:

Put source code files in datasets/datasetName/{author}/{files}. Make sure files of each author are in a single directory.
Run data extraction to mine path-contexts from the source files:

java -jar attribution/pathminer/extract-path-contexts.jar snapshot \
    --project datasets/datasetName/ \
    --output processed/datasetName/ \
    --java-parser antlr \
    --maxContexts 2000 --maxH 8 --maxW 3

Depending on the language, extracted data will be in the processed/datasetName/{c,cpp,java,py} folder.
To run cross-validation, create a configuration file (e.g., PbNN or PbRF)) and run python -m run_classification path/to/config in attribution/authorship_pipeline folder.

Results

IntelliJ Community

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
attribution		attribution
datasets		datasets
figures		figures
processed		processed
.gitignore		.gitignore
Hyperparameters.md		Hyperparameters.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Authorship detection

Papers used for comparision

Datasets

Project structure

Results

IntelliJ Community

44 developers, 2000 to 10000 samples each (context separation)

44 developers, 2000 to 10000 samples each (time separation)

PbNN

PbRF

JCaliskan

21 developers, at least 10000 samples each (context separation)

21 developers, at least 10000 samples each (time separation)

PbNN

PbRF

JCaliskan

Gradle

28 developers, at least 500 samples each (context separation)

28 developers, at least 500 samples each (time separation)

PbNN

PbRF

JCaliskan

Mule

16 developers, 1000 to 5000 samples each (context separation)

7 developers, at least 5000 samples each (context separation)

About

Releases

Packages

Languages

License

JetBrains-Research/authorship-detection

Folders and files

Latest commit

History

Repository files navigation

Authorship detection

Papers used for comparision

Datasets

Project structure

Results

IntelliJ Community

44 developers, 2000 to 10000 samples each (context separation)

44 developers, 2000 to 10000 samples each (time separation)

PbNN

PbRF

JCaliskan

21 developers, at least 10000 samples each (context separation)

21 developers, at least 10000 samples each (time separation)

PbNN

PbRF

JCaliskan

Gradle

28 developers, at least 500 samples each (context separation)

28 developers, at least 500 samples each (time separation)

PbNN

PbRF

JCaliskan

Mule

16 developers, 1000 to 5000 samples each (context separation)

7 developers, at least 5000 samples each (context separation)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages