Resources

In this project I outputed multiple resources, namely a novel Portuguese corpus (SIGARRA News Corpus), trained NER models with the HAREM collection and with the SIGARRA News Corpus, and the HAREM collection into the tools format.

SIGARRA News Corpus

SIGARRA News Corpus dataset, with entity annotations for 905 University of Porto news. Check SIGARRA News Corpus for more info.

Ref: André Pires, Sérgio Nunes, José Devezas. SIGARRA News Corpus. https://rdm.inesctec.pt/dataset/cs-2017-004, jun 2017.

Trained NER models with HAREM

Pre-trained models for named entity recognition in Portuguese, using the categories, types and subtypes of the Second HAREM dataset as entity classes. Trained models for Stanford CoreNLP, OpenNLP, spaCy and NLTK.

Ref: André Pires. HAREM NER Models for OpenNLP, Stanford CoreNLP, spaCy, NLTK. https://rdm.inesctec.pt/dataset/cs-2017-005, jun 2017.

Trained NER models with SIGARRA News Corpus

Pre-trained models for named entity recognition in Portuguese, using the following entity classes: Hora (Hour), Evento (Event), Organizacao (Organization), Curso (Course), Pessoa (Person), Localizacao (Location), Data (Date) and UnidadeOrganica (Organic Unit). Trained models for Stanford CoreNLP, OpenNLP, spaCy and NLTK.

Ref: André Pires. SIGARRA News Corpus NER Models for OpenNLP, Stanford CoreNLP, spaCy, NLTK. https://rdm.inesctec.pt/dataset/cs-2017-006, jun 2017.

HAREM collection in other formats

The HAREM collection in the 4 different formats, one for each tool (OpenNLP, Stanford CoreNLP, spaCy and NLTK), divided into the 3 entity levels. Get them here.

Master thesis

My master thesis on this topic (NER for Portuguese), with all the development explained in detail.

Ref: André Pires (2017). Named entity extraction from Portuguese web text. Master thesis, Faculty of Engineering University of Porto. Retrieved from http://hdl.handle.net/10216/106094.