All the resusts obtained by using the material in this repository can be found in the Characters Extraction and Analytics section of the GOBBYKID Gitbook. For further information and other analyses we suggest you to read the following sections of this description or to explore the other pages of the website.
- Corpus: it contains two sub-directories (one for male authors and one for female authors) with the analyzed books.
- non_characters_csv: it contains some csv files used to exclude names that are actually not characters' names.
The Python functions developed to extract the characters' names from the corpus of books are organized in two main files:
- charEx.py consists in the functions used to process the text and to perform the Named Entity Recognition in order to store the characters' names into a set that is then processed by the functions contained into the genderRec file.
- genderRec.py contains the functions used to process the set of names in order to classify each one with a label expressing the character's gender.
The Natural Language Processing tasks needed for the characters' extraction have been performed by using more than one library. In particular, after having compared the performances of NLTK and SpaCy it has been noted that both libraries extracted characters that the other library missed, and different characters that were not really characters. In conclusion, both libraries has been used, and the results have been filtered by means of a more straightforward function that extract proper nouns without performing a proper Named Entity Recognition.
- get_characters is the main function. It takes a book in input and preprocesses it by removing characters that can influence negatively the segmentation of text into sentences. Then, the book is given as input to three different functions that independently segment the text into sentences and tokens to finally extract characters' names. The result of each extraction is stored into a set. The union of the sets coming from the extraction performed by using NLTK (get_charaters_nltk) and spaCy(get_characters_spacy) is intersected by the set generated by the get_proper_nouns function. However, the set generated by the last function is firstly processed by: 1) maintaining only the words that occur in text with the first letter capitalized and at least once not at the beginning of a sentence and 2) by excluding all the words that may be referred to cities, nationalities, countries, and other similar categories by means of the check_names function.
- get_charaters_nltk and get_characters_spacy perform the Named Entity Recognition. Naturally, only the entities identified as “PERSON” have been included in the final set. Under this label also names of “animal characters” have been included. The Named Entity Recognition, however, is not a precise task, so into the set also names referred to entities that are not characters are included. We demand the explanation of the technical procedure used by the two libraries in order to extract the entities' names to their documentations linked above.
- get_proper_nouns aims at providing information related to the probability of a word being a character name or surname. In particular, each word of the text is stored into a dictionary with the count of times in which it occurred with the first letter capitalized, in lowercase, and not at the beginning of a sentence. The keys are stored in the dictionary not always as single words, but, in cases in which a full name is present into the text (e.g., John Doe) the full occurrence is considered instead. In this way, the names should match the ones extracted by the other functions. The segmentation of the text in sentences is done by means of thesyntok_list_of_sentences function, in which the Syntok Segmenter is used.
Also for the gender recognition two different libraries has been used: genderComputer and Genderize. Genderize is capable of determining the gender of some names that genderComputer is not able to classify. However, it works only through requests made to an API and has a daily limit of 1000 requests. Two solutions have been found to overcome this problem. The first consists in using another library (genderComputer) to recognize names' gender, and using Genderize only for the cases in which the first library cannot assign a label with certainty. With few books, this solution is valid. However, with more than 200 texts, the daily limit of Genderize is still exceedable. The second consists in extracting the names from the corpus more than once and, at each iteration of the code execution, to store the already classified names into two different CSV file (one for male and on for female characters). These files are used in the iterations of the execution in order to check whether the names extracted from the book have been already classified as female or male without using the methods provided by the libraries. At each iteration, the files are updated. Moreover, we decided to execute the extraction on the split corpus (male author's corpus and female author's corpus) by two different computers in order to accelerate the process (that is why there are more than one get_characters files in the repository). Naturally, each iteration occurred at different times (one per day for each sub-corpus) due to the fact that the request limitation resets on a daily basis.
- gender_recognition tries to identify the gender of each name contained in the set it takes in input (in this case, the one extracted by the functions of the previous file). In the set, names can be composed by a single or by multiple words. Nevertheless, while Genderize is able to work with full names, genderComputer works only with single-words names.
- Before approaching the proper gender recognition, the names are firstly divided into two lists and the single-word names that are most likely to be surnames are removed. This step is done on the basis of a comparison between the single-word names and the multiple-words ones by executing the filter_surnames function. It basically removes the names that are composed by a single word, but are also present in the list of names composed by multiple words at the last index of the name. Thus, a first attempt is done to recognize the gender of a character by looking at “gendered words” in the name such as “Miss”, “Mr”, etc. If it is not possible, genderComputer try to classify the single-word names, while the multiple-words ones are split into single tokens on which genderComputer operates iteratively until it provides a certain classification to at least one of them. In the cases in which genderComputer do not provide any certain label for the name, Genderize takes it place.
- At the end, three lists are compiled: one for the female names, one for the male ones, and one for the names whose gender has not been identified. Before appending a multiple-words name to one of the lists, an attempt is done in order to try to not include similar occurrences of the same character's name. Therefore, the function check_list_mwn is executed to check whether a name is equal to another by excluding the first word they are (or it is) composed by. For example, John Doe, Mr. John Doe, and Dr. John Doe are likely to be the same character, so only one of these occurrence is kept.