A full report on my Google Summer of Code 2020 work with FOSSology
Project: "Accelerating Atarashi" 👨💻
A Python library for Comments and Source Code Extraction
- Codebase: GitHub
- Library: PyPI
- Documentation: Nirjas-Wiki
To scan for various open-source licenses inside of the file, one of the crucial parts is to extract the comments out of the code so that the base algorithm(agents) can detect the license. The license texts are in the comment section which makes it separate from the original codebase.
I and Kaushlendra worked on developing a fully dedicated Python library from scratch for these tasks and managed to publish the initial version at PyPI before the first evaluation.
Nirjas is live at PyPI and can be installed using pip install nirjas
.
The major task was to classify different types of comments and to write separate logic for each one of them. The types are:
- Single line comments
- Multi-line comments
- Continuous single lines (continuous lines commented out using single-line syntax at each line)
- Inline comments (the comments that are written after the code on the same line)
The library can extract comments as well as code out of files from more than 20 popular programming languages. Along with that the library also serves you with all the required metadata about your Code, Comments and File(s). The library is available for public use and can be used in projects ranging from various domains.
- feat(extractor): add language indentifier #1
- Complete development branch into master #2
- feat(nirjas): Add continuous single line as multiline & bug fixes #6
- Add Text File Support and Bug Fix #8
The complete list of Open and Closed PRs can be found at Nirjas/Pull requests
The next task was to replace existing code comment extractor with Nirjas in Atarashi. At this point, the existing code comment extractor was not working at all and was throwing an error whenever an agent was called. So we took the right decision to create our code comment extractor.
Nirjas supports almost all the major programming languages currently and will be continuously developed and maintained by FOSSology itself.
The integration is done in such a way that it will extract and pass only those comments which contain a license statement. The comments classification by Nirjas played a big role here followed by our customized list of tokens which helps us find the actual license comment out of all other comments. Earlier the comment extractor used to pass all the comments which made the input string little bit noisier to detect.
A small change was done in the Evaluator where the testing files were zipped and the existing code was improved.
The main idea was to create an Inverted Index for all the license texts and then use TF-IDF score to detect the licenses. This was supposed the decrease the detection time drastically and make agents faster.
The inverted index created is in the form:
{
"keyword1": [
[
"doc1",
TF-IDF Score
],
[
"doc2",
TF-IDF Score
]
],
"keyword2": [
[
"doc3",
TF-IDF Score
],
[
"doc2",
TF-IDF Score
],
[
"doc'n'",
TF-IDF Score
]
]
}
Then for every input comment, we are extracting the keywords and comparing their TF-IDF Scores with the posting inside of Inverted Index file. The documents having the closest TF-IDF scores are ranked in order and the top result is returned as our detected license.
Although the algorithm succeeded in decreasing the scanning time from around 1200 secs to 260 secs (for 100 files) unfortunately we were not able to increase the accuracy. After applying various searching techniques, the maximum accuracy we got was 50% which is less than the original TF-IDF agent (i.e 59%).
According to me the two main factors that affect the performance of the algorithms(in terms of accuracy) are:
1. Irregularity in the size of license texts
The license texts are of different sizes and there is a huge difference in terms of keywords count which abrupt the postings of the keywords. Longer texts contain most of the unique keywords which mess up with the uniqueness of keywords in the smaller texts. Due to this, the longer license texts dominate the resulting output. For better results, the texts should be normalized which will give equal opportunity to all the license texts.
2. License texts are different than traditional text corpora
Usually in the traditional corpus, the documents are different to some extent that differentiate them with each other. But in license texts, most of the tokens are similar in the majority of the texts and there is a very slight difference in the use of these token to create a license statement. The keywords used in these license texts can be found in almost all of them with a slight variation which makes sense because they all the eventually open-source licenses talking about open source software and permissions. These similarities in-licenses make them tough to be differentiated by any traditional information retrieval algorithm.
- GitHub Repo: TFIDF-Invterted-Index
To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. But unfortunately, there exists no such dataset for open source licenses on the web.
To generate the dataset the base approach we used is to n-gram the paragraphs of license texts and to generate different permutations and combinations of them Suppose a license text has 5 paragraphs [1,2,3,4,5] in order. To create a dataset we include subsets like [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5] for all combinations starting from 1,2,3,4 and 5. each one with the same label.
Using this technique we were able to generate more than 1 million files from 447 SPDX license files.
Now, as all the para are not equally important and most of these will create a lot of noise in the dataset. To resolve this, we'll be choosing para with high relevance and then will repeat the same process.
Few Updation that we need to do is:
- Shifting from txt files to SPDX JSON endpoint
- Differentiating License Header from Full Text
- Adding FOSSology Nomos agent STRINGS.in regex in dataset creation
- GitHub Repo: SPDX OSS Dataset
During the GSoC period, I got the time to create and organize documentation for both Atarashi and Nirjas. The documentation contains all the user and developer information of the project and is organized in a way to be easily accessible by all.
The Documentation can be found at:
- Atarashi - Atarashi GitHub Wiki
- Nirjas - Nirjas GitHub Wiki
Tasks | Planned | Completed | Remarks |
---|---|---|---|
Creating Nirjas | Yes | ✔️ | Beta version is live & the project will be developed & maintained continuously |
Publish to PyPI | Yes | ✔️ | Nirjas is live and can be installed and used in projects |
Integrate Nirjas with Atarashi | Yes | ✔️ | We can select specific license comment part from all comments |
Implementing Inverted Index with TF-IDF | Yes | ✔️ | Desired accuracy can not be achieved with this algorithm |
Creating SPDX OSS Dataset | No | ✔️ | dataset can be improved further and the development is continuously going on |
Implementing BERT (OPTIONAL) | Yes but was OPTIONAL | ❌ | can only be implemented after the dataset creation |
- Implement complete regex in Nirjas covering most of the boundary cases.
- Improving the created SPDX OSS Dataset
- Continue developing Nirjas and Atarashi
- Maintaining Nirjas and Atarashi
- Searching for other methods to be implemented for license scanning
- Learned about various NLP techniques by studying, testing and implementing them
- various Open-Source licenses and their Importance in codes, projects and software.
- Learned to develop a complete library from scratch
- Packaging of Python Projects and how it is maintained
- Sharpened my skill of GIT
- Various Information retrieval algorithms & traditional searching techniques
- Learned to create a better and cleaner dataset.
- Improved my knowledge of Data Science
- Learned the importance of time management as well as perfect deliverables.
- Improved my documentation skill
- Improved my communication & presentation skill