Why there is a need to generate a dataset?
- To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. Due to the lack of dataset currently, all the 10 algorithms which have been tested on Atarashi are restricted to 59% accuracy. But unfortunately, there exists no such dataset for open source licenses on the web.
- Advanced Architectures and algorithms such as LSTMs, GRU, BERT, WordNET, etc. require huge volumes of the dataset before achieving the ability to outperform the accuracy of even traditional algorithms such as TF-IDF, n-gram, etc.
- Licenses differ from traditional corpora, because of which 50-60% keywords are similar in any two licenses, and if the licenses have the same license heading but different versions, they're around 90% similar.
SPDX recent release : SPDX
python ./Download-licenses-Script/spdx.py
SPDX-exceptions recent release : SPDX-exceptions
python ./Download-licenses-Script/exceptions.py
Licenses in Fossology Database : licenseRef
python ./Download-licenses-Script/database-foss.py
The basic idea is n-gramming licenses and maintaining a sliding window, i.e for a licene with 4 paragraphs, all the different files that I wanted to generate were - para1, para2, para3, para4, para1+para2, para2+para3, para3+para4, para1+para2+para3, para2+para3+para4, para1+para2+para3+para4. Not para1+para3, para1+para3+para4, etc. because the structure of licenses needs to be maintained.
python ./Script-Initial-Split/initial_split.py
Script : initial_split
Files : SPDX
Files : FOSSologyDatabase
Regex from STRINGS.in file is added to splitted files. Regex expansion is done through free and open-source libraries such as xeger, intxeger
To handle expansions i.e .{1,32}, .{1,64} two algorithms are being considered :
A. NGRAM
(basically a set of co-occurring words within a given window)
B. MARKOV
(As an extension of Naive Bayes for sequential data, the Hidden Markov Model provides a joint distribution over the letters/tags with an assumption of the dependencies of variables x and y between adjacent tags.)
Added "Multiprocessing" to the Script to speed up the process of data generation.
Codebase : Ngram
To generate licenses with ngram expansion:
python ./ngram/licenses.py
Codebase : Markov
To generate licenses with ngram expansion:
python ./markov/markov_licenses.py
Using Nomos to validate generated files. This is a base line regex-based validation for the generated text files using both the algorithms. Terminal command to run this will be :
sudo nomos -J -d <folder_with_files>
And to use multiple cores to validate files (here I am using 3 cores) :
sudo nomos -J -d <folder_with_files> -n 3