More and more people are exchanging text messages through the use of social media, and the analysis of the information can be used to make statistics in the behavior and in people's psychology. Using Natural Language Processing (NLP), we can extrapolate key words from each message that allow us to achieve the proposed goals. The following paper discusses the development of an Automatic Information Extraction system from English-language text messages by using of the spaCy library that provides a set of pre-trained templates using the NER technique. In the following case, the model considered is RoBERTa which we will go on to analyze in the following paragraphs.
The first task performed was to identify the dataset to be used for the task introduced in the previous paragraph. The dataset used was: SMS-NER-Dataset-165-Annotations found on kaggle at the following link. Next, a data cleaning was performed on the dataset in order to ensure uniformity in data representation. After that, the cleaned dataset was divided into training and testing set and converted to .spacy
format so that it could be computed by the chosen model. Next, the config.cfg
file was generated, which is nothing but a configuration file with all the hyperparameters and settings that the model has to comply with. After that, the training part was given as input to the pre-trained model and in output were saved two models:
model-last
: the model trained in the last iteration (it could be used to resume the training at a later time);model-best
: the model that scored highest on the test dataset;
In this section we are going to cover the implementation parts. In particular, we will discuss the structure of the dataset and the configuration files.
The dataset is in json format and is structured as follows:
-
"
classes
": contains the list of tags to be identified within the messages: "MONEY", "TITLE", "OTP", "TRANSAC", "TIME", "PURPOSE". - "
annotations
": contains the message list and entity class for each message; - "
entities
": each entity is an array of tuples where each tuple has within it two integers and a tag (the integers are the coordinates of the tag associated with a specific phrase, e.g. [19,26, "TRANSAC"]).
Within the SMS-NER-Dataset-165-Annotations folder we find the base_config.cfg
configuration file used to set up the model that will be trained on the previous dataset.
To set up the model structure we run the command:
python -m spacy init fill-config dataset/SMS-NER-Dataset-165-Annotations/base_config.cfg config.cfg
After that, it will start the training phase and finally the of testing by running the command:
python -m spacy train config.cfg -output ./output -paths.train train.spacy -paths.dev test.spacy --gpu-id 0
To conclude, we print the metrics produced by the best model by running the command:
python -m spacy benchmark accuracy model/large/model-best model/large/test.spacy -output -code -gold-preproc -gpu-id 0 -displacy-path model/large
The report can be found at the follow link: Report.
Name | Description |
---|---|
Alberto Montefusco |
Developer - Alberto-00 Email - a.montefusco28@studenti.unisa.it LinkedIn - Alberto Montefusco My WebSite - alberto-00.github.io |
Alessandro Aquino |
Developer - AlessandroUnisa Email - a.aquino33@studenti.unisa.it LinkedIn - Alessandro Aquino |
Mattia d'Argenio |
Developer - mattiadarg Email - m.dargenio5@studenti.unisa.it LinkedIn - Mattia d'Argenio |