This page servers as an online appendix for the research paper: "Information Detection in Business Analysis Transcripts; an Ontology Approach" (currently unpublished), written by Tjerk Spijkman, Boris Winter, Sid Bansidhar and Sjaak Brinkkemper. Aside from providing extra information, we also provide public access to the used code for both transparancy and data sharing purposes.
The created NLP Tool is able to extract both known and unknown terms from a given transcript. The extraction process starts with some initial text processing, where the timestamps are removed from the transcript document. Subsequently, two operations are executed simultaneously, for the known and unknown words. First, punctuation marks and stop words are removed from the text. Next, the transcript is split up conforming to the different speakers that are present in the transcript. For each speaker snippet, the script iterates through each word, matching it with the ontology. If the concept is found in the ontology, it will be added to the table. All tables will are then merged into one final table, containing all the ontology matches. For the unknown words, the extraction process has some minor differences. Instead of matching the text to an ontology, the Noun Phrase property of the TextBlob package is used. This is a combination of words that occur together in a sentence and that are headed by a noun. Noun phrases can be useful for indicating what is being is discussed in a transcript and can possibly indicate a requirement. These noun phrases are added to the table for the associated speaker. Once again, all tables are merged for the final result. Before processing the text with the Textblob package, ontology matches are removed from the transcript, in order to prevent false positives.
The figure below shows BPMN process-model of the NLP tool.
The NLP tool that has been created was made in Python (.ipnyb), which can be found at: https://github.com/Bowis/keyextractor. It consists of 4 files:
- splitter.ipynb: this file is able to split the transcripts into multiple sentences, allowing further tokenization.
- nltk.ipynb: this file can be used to get some basic statistics of the transcript that is entered.
- known.ipynb: this file is able to match and count known terms (present in the ontology/ontology.txt file) that are present in the transcript that is entered.
- unknown.ipynb: this file is able to find possibly unknown terms in the transcript with the use of the Textblob package (https://textblob.readthedocs.io/en/dev/).
The dashboard is an possible extension of the NLP tool created during the research. It visualizes the generated results in a concise manner, and allows business analyst and other users to quickly identify existing and unknown concepts in the uploaded transcript.
A mockup of this dashboard has been drafted during the research project, which will be elaborated on in the following section.
Bare in mind, the following screenshots are a mockup, they do not represent a final, working dashboard.
In the first screen of the dashboard, shown in the image above, the user is able to upload a transcript and a ontology file. The ontology file is used to match the transcript against. Once the user has uploaded both files, he/she can start the transcription process.
Once the transcription process has been completed, the user is presented with the dashboard. In the current version of the mockup, the dasboard consists of four main elements:
- Transcript statistics: some general statistics of the processed transcript, such as the total amount of words in the transcript.
- Possible transcript subjects: here the algorithm makes an assumption about the subject of the transcript, paired with its confidencelevel (shown in percentages). The user is able to provide feedback to the algorithm, by either confirming or denying the assumption.
- Found unknown words: the unknown words, i.e., noun phrases that were not present in the ontology, that were generated by the NLP tool are shown here.
- Found known words: the known words from the ontology are shown in here, paired with their frequency. For both the unknown and the known words, the user is able to press that specific word. This will open the Interactive PDF-reader (shown in the following section ) at the point of occurence for that word.
This page serves as an interactive PDF-reader, where the user can quickly find the occurrences of the known and unknown words, generated by the NLP tool. The found words are highlighted, so that they are easy to spot in the lengthy transcripts. The user can then iterate through all the word occurences with the previous and next buttons, floating on the transcript. A small selection of the uknown and known words are shown on the left, which can be expanded by the "More Words" button. The user is also able to execute custom searches, with the use of the searchbar at the top of the page.