GitHub - Alberto-00/Estrazione-Automatica-di-Informazioni-da-Testi: More and more people are exchanging text messages through the use of social media, and the analysis of the information can be used to make statistics in the behavior and in people's psychology. Using Natural Language Processing (NLP), we can extrapolate key words from each message that allow us to achieve the proposed goals.

1 Introduction

1.1 Problem

More and more people are exchanging text messages through the use of social media, and the analysis of the information can be used to make statistics in the behavior and in people's psychology. Using Natural Language Processing (NLP), we can extrapolate key words from each message that allow us to achieve the proposed goals. The following paper discusses the development of an Automatic Information Extraction system from English-language text messages by using of the spaCy library that provides a set of pre-trained templates using the NER technique. In the following case, the model considered is RoBERTa which we will go on to analyze in the following paragraphs.

1.2 Workflow

The first task performed was to identify the dataset to be used for the task introduced in the previous paragraph. The dataset used was: SMS-NER-Dataset-165-Annotations found on kaggle at the following link. Next, a data cleaning was performed on the dataset in order to ensure uniformity in data representation. After that, the cleaned dataset was divided into training and testing set and converted to .spacy format so that it could be computed by the chosen model. Next, the config.cfg file was generated, which is nothing but a configuration file with all the hyperparameters and settings that the model has to comply with. After that, the training part was given as input to the pre-trained model and in output were saved two models:

model-last: the model trained in the last iteration (it could be used to resume the training at a later time);
model-best: the model that scored highest on the test dataset;

Finally, Precision, Recall and F1-Score metrics were reported. In order to best perform the information extraction task, 3 different pre-trained models were used in accuracy for the prediction of tags and compared with each other. The models used were: 1. en_core_web_sm; 2. en_core_web_md; 3. en_core_web

2 Approach

In this section we are going to cover the implementation parts. In particular, we will discuss the structure of the dataset and the configuration files.

2.1 Dataset

The dataset is in json format and is structured as follows:

"classes": contains the list of tags to be identified within the messages: "MONEY", "TITLE", "OTP", "TRANSAC", "TIME", "PURPOSE".
"annotations": contains the message list and entity class for each message;

"entities": each entity is an array of tuples where each tuple has within it two integers and a tag (the integers are the coordinates of the tag associated with a specific phrase, e.g. [19,26, "TRANSAC"]).

Next, the dataset is divided into two parts: train and test set. If a message has the associated entity class empty, then this is filled with the tuple [(0, 0, 'PEARSON')].

2.3 Configuration File

Within the SMS-NER-Dataset-165-Annotations folder we find the base_config.cfg configuration file used to set up the model that will be trained on the previous dataset. To set up the model structure we run the command:

python -m spacy init fill-config dataset/SMS-NER-Dataset-165-Annotations/base_config.cfg config.cfg

After that, it will start the training phase and finally the of testing by running the command:

python -m spacy train config.cfg -output ./output -paths.train train.spacy -paths.dev test.spacy --gpu-id 0

To conclude, we print the metrics produced by the best model by running the command:

python -m spacy benchmark accuracy model/large/model-best model/large/test.spacy -output -code -gold-preproc -gpu-id 0 -displacy-path model/large

3 Report

The report can be found at the follow link: Report.

4 Author & Contacts

Name

Description

Alberto Montefusco

Developer - Alberto-00

Email - a.montefusco28@studenti.unisa.it

LinkedIn - Alberto Montefusco

My WebSite - alberto-00.github.io

Alessandro Aquino

Developer - AlessandroUnisa

Email - a.aquino33@studenti.unisa.it

LinkedIn - Alessandro Aquino

Mattia d'Argenio

Developer - mattiadarg

Email - m.dargenio5@studenti.unisa.it

LinkedIn - Mattia d'Argenio

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
code		code
presentazione		presentazione
.gitignore		.gitignore
Estrazione_automatica_di_informazione_dai_testi.pdf		Estrazione_automatica_di_informazione_dai_testi.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Introduction

1.1 Problem

1.2 Workflow

2 Approach

2.1 Dataset

2.3 Configuration File

3 Report

4 Author & Contacts

About

Releases

Packages

Contributors 3

Languages

Alberto-00/Estrazione-Automatica-di-Informazioni-da-Testi

Folders and files

Latest commit

History

Repository files navigation

1 Introduction

1.1 Problem

1.2 Workflow

2 Approach

2.1 Dataset

2.3 Configuration File

3 Report

4 Author & Contacts

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages