Copyright False Positive Detection using ML @ FOSSology
Project Details | Contributions | Deliverables | Future Goals | Key Takeaways | Acknowledgements
In the current scenario, most of the projects use copyright agents defining the mode of usage for their product. Software like Fossology uses the rule-based approaches for copyright detection and scanning. Agents like nomos use Regex based approaches to extract the statements from a project then a Regex based agent shows the results with the several copyrights statements used in the project followed by a set of agents which does the Deactivation of the copyrights which are False. Still, there are a lot of statements that are left in the Agent's findings and then an user has to involve in the Manual findings. It has become a two step process which was not ideal for the case.
My proposed ideas and objectives revolved around Fossology entirely i.e. from including a Natural Language Processing based approach for pre-processing and then recognising a pattern between the false and a true copyright statement with the help of NLP and Automation. Another given functionality was to remove the clutter from original extracted copyright statement. Entire goal for the proposed ideas was to introduce new functionalities into Fossology.
A Python based approach to analyse copyright statements
- Codebase: GitHub
- Documentation: FalsePositiveDetection-repo
One thing about copyright statements is very intriguing i.e. They looks so predictable but there are millions variations to how they look and how many things they can contain. Despite of having a predictable architecture. The first task revolved around from understanding the architecture with the close filter of all types of (TEXT UNDERSTANDINGS) i.e. the types of entities and the parts of speech in our case.
From there, I decided to predict a specific structure that is being followed by most of the copyright statements despite the variations to how they look. Two different lists of Named Entities and POS tags are then hypothised. These hypothised lists helped in benchmarking as an ideal structure. It cleared the further understanding and outine of the complete project.
According to NER, the structure looked like:
Statement: "Copyright (c) 2021, Kaushlendra Pratap (kaushlendra@xyz.com)"
Probable NER Entity looks Like: ['DATE', 'PERSON', 'CARDINAL', 'ORG']
Probable POS Tags looks Like: ['NOUN', 'NUM', 'PROPN', 'PROPN']
After testing the architecture predicted and getting good accuracy in recognising most of the copyright statements from the required datasets provided. The compilation of the script started. The working of script looked like:
The task was divided into three sections:
- Text-preprocessing to make the input data more accurate and with less clutter.
- Defining a function that calculated the NER and POS tag for each statement, iteratively.
- Two stage filtering if-else ladder with the mechanism to update "T" if the match is found and "F" if not, A new characteristic was introduced in CSV i.e."is_copyright"
Accuracy Calculation :
The accuracy calculation was done with the help of several datasets which are manually marked by Human. Iterating over the CSV,
IF ManualTag == AlgorithmTag:
counter += 1;
accuracy_score = (counter/total_occurence)*100
The accuracy was divided into: FP_accuracy, TP_accuracy, TN_accuracy and FN_aacuracy.
Final Precision = (TP + FP)/(TP + FP + TN + FN)
Copyright statements are not ideally with direct structure that comprise of license statements appended to them at the end.
Normal Copyright with clutter:
Copyright (c) 2021, Kaushlendra Pratap Singh. Distributed Under the MIT license ....
Copyright with clutter removal:
Copyright (c) 2021, Kaushlendra Pratap Singh
The approach taken was:
IF is_copyright == "t"
string = copyrightStatement;
IF 'ORG', 'PERSON' in NER_LIST:
clutter = string[0:string.index(org_name)] (same way person_name)
RESULTS :
Fossology has a list of several agents like Nomos, Monk, Ninka, Decider etc. etc. The main goal was to intoroduce the python script into the PHP code and then use it as a Decider agent.
The tasks in hand were:
- To create two flags on UI with Copyright Deactivation and Copyright Deactivation with Clutter removing
- Create two rules and two seperate function in
DeciderAgent.php
to call the python script and then Update the Database with the True and Deactivated copyright statements. - To differentiate between the functionality of both the functions and providing the absolute
$uploadID, $content, $action and $hash
. - Installing changes in the Makefile to install the script with
make install
. - Creating a
mod_deps
file to introduce and install the dependencies required to run the script.
Each task was accomplished and the agent was completely integrated.
RESULTS
The after working results :
-
The Pull request with the script, integration changes, UI change and Database updation code is: Check the PR from here
-
The Progress has been regularly marked every week and they are kept in a seperate wiki. Check WPRs from here
-
The setting up and user documentation of the script in Decider Agent can be visited here: Check documentation
-
The installation and user documentation for the Jupyter Notebook can be visited here: README
Tasks | Planned | Completed | Remarks |
---|---|---|---|
Introducing NER and POS tagging for Copyright Statements | Yes | ✔️ | This was like the POC for the idea. |
Implementing the Hypothesis as a working product. | Yes | ✔️ | The working of the script is efficient but can be improved further. |
Accuracy Score calculation and Testing | Yes | ✔️ | The accuracy is acceptable but can be improved with more checks involved |
Integrating the Script with Fossology | Yes | ✔️ | Integration is done and can be used with fossology installation |
Documenting the working of Script | Yes | ✔️ | NONE |
- Implementing further more layer of checks to cover the edge cases.
- Going through other NLP techniques to understand some other perspectives of the copyright statements.
- Maintaining the agent and look for achieving further more accuracy in clutter removal techniques.
- Be with Fossology community as contributor and help future developers to get started with Fossology, Atarashi and Nirjas.
- Continue Maintaining Atarashi and Nirjas.
- Learnt the art of collaboration and working on real-time software development.
- Improved programming skills, including OOP concepts and Modular Programming.
- Learnt alot about NLP Techniques for pre-processing texts.
- Learnt about importance of Open-Source Copyrights and their detail figurative analysis.
- Improved Git skills.
- How a full fledge system like fossology functions in Model, View and Controllers perspectives.
- Better analysis of code and debugging more easily.
- Importance of a well equipped dataset and creating one from scratch for training our own NER model.
- Punctuality and adaptability according to time and situation.
- Communicating properly, presenting the code and keep on asking doubts.
This year Google Summer of Code came with extra fun because it was my second time participating with Fossology and It is going with a little sadness because it was my last time to be participating with fossology as student developer. There are several people to whom I want to extend my regards to.
I want to thank and appreciate my mentors Michael C. Jaeger, Anupam Ghosh, Gaurav Mishra, Vasudev Maduri, Ayush Bharadwaj and Shaheem Azmal M MD. without the help and support from them, all this would not have been possible.
Now, I would like to extend my regards to two very important figures who helped me to steer across all the challenges(PS: Not just GSoC :P), Ayush Bharadwaj and Sahil Jha.
Finally, I am glad to meet all the fellow developers down the road. You guys are awesome and keep doing the great work.