The exercise shows how we may extract elements such as names of companies, countries and similar objects from text.
Objectives (8 points):
- Read the API of NER in Spacy
- Take 1 thousand random passages from the FIQA-PL corpus.
- Use the lemmatized and sentence split documents (using SpaCy API for lemmatization and sentence splitting) to identify the expressions that consist of consecutive
words starting with a capital letter (you will have to look at the inflected form of the word to check its
capitalization) that do not occupy the first position in a sentence. E.g. the sentence:
should yield the following entries:
Wczoraj w Krakowie miało miejsce spotkanie prezydentów Polski i Stanów Zjednoczonych.
Kraków
,Polska
,Stan Zjednoczony
. - Compute the frequency of each identified expression and print 50 results with the largest number of occurrences.
- Apply the NER from SpaCy to identify the named entities in the same set of documents.
- Plot the frequency (histogram) of the identified classes.
- Display 10 most frequent Named Entities for each identified type.
- Display 50 most frequent Named Entities including their count and type.
- Display 5 sentences containing at least 2 recognized named entities with different types. Highlight the recognized spans with color. (For demo application Streamlit might be useful for displaying NER results).
Answer the following questions (2 points):
- Which of the method (counting expressions with capital letters vs. NER) worked better for the task concerned with identification of the proper names?
- What are the drawbacks of the method based on capital letters?
- What are the drawbacks of the method based on NER?
- Which of the coarse-grained NER groups has the best and which has the worst results? Try to justify this observation.
- Do you think NER is sufficient for identifying different occurrences of the same entity (i.e. consider "USA" and "Stany Zjednoczone" and "Stany Zjednoczone Ameryki Północnej")? If not, can you suggest an algorithm or a tool that would be able to group such names together?
- Can you think of a real world problem that would benefit the most from application of Named Entity Recognition algorithm?
- Named entity recognition is a process aimed at the identification of entities mentioned in text by determining their scope and classifying them to a predefined type. The larger the number of types, the more difficult the problem is.
- Named entities are usually proper names and temporal expressions. They usually convey the most important information in text.
- IOB format is typically used to tag names entities. The name (IOB) comes from the types of tokens (in, out, beginning).
The following example shows how the format works:
W O 1776 B-TIME niemiecki O zoolog O Peter B-PER Simon I-PER Pallas I-PER dokonał O formalnego O ...
- The set of classes used in NER is partially task dependent. Some general classes such as names of people or cities are used universally, but categories such as references to law regulations is specific to legal information systems.