Textual measures
- Stylistic
- Syntactic
- Readability
- Sentiment Analysis-based measures: Simple features, Complex measures
- EmotionArcs
- Perplexity
Quality proxies
Extra
Number of words in text.
Length of sentences in text measured in characters.
Mean Segmental Type-Token Ratio is a measure of lexical richness. It segments the text in segments of a given size (here 100 words, often taken as standard) and calculates the Type-Token Ratio for each segment - then takes the average of all segment ratios of the whole text.
The type/token ratio of only verbs in the text. To account for length, we only did the simple TTR of the first 60.000 tokens of each text.
The type/token ratio of only nouns in the text. To account for length, we only did the simple TTR of the first 60.000 tokens of each text.
Compressibility of the text-files as calculated by dividing the original bitsize of the text with the compressed bitzsize (using bzip2 compression). We calculated the compression ratio (original bit-size/compressed bit-size) for the first 1500 sentences of each text.
We use the bz2 module
Settings: bz2.compress(arc.encode(),compresslevel=9)
Both word and bigram entropy was calculated by means of Mark Algee-Hewitt’s github for the Stanford Literary Lab pamphlet 17. The code was modified to also asses entropy on the word basis (while pamphlet 17 inly includes bigram-basis). Adopted from Algee-Hewitt's repository.
Stopwords were not removed. Measures the “predictability”/amount of information of words or bigrams in the text.
Range of tags extracted using the small spaCy model (en_core_web_sm)
Adjective frequency of each text (not normalized, e.g., by wordcount)
Noun frequency of each text (not normalized)
Verb frequency of each text (not normalized)
Pronoun frequency of each text (not normalized)
Punctuation-mark frequency of each text (not normalized)
Stopword frequency of each text (not normalized)
Nominal subject frequency of each text (not normalized)
Passive auxiliary frequency of each text (not normalized)
Auxiliary frequency of each text (not normalized)
Relative clause modifier frequency of each text (not normalized)
Negation modifier frequency of each text (not normalized)
Simply the number of verbs divided by the number of nouns in a text (not normalized)
Simply the number of averbs divided by the number of verbs in a text (not normalized)
The number of active verbs minus the number of passive verbs in a text (not normalized)
The number of passive verbs divided by the number of active verbs in a text (not normalized)
The number of adjective + the number of nouns divided by the number of verbs in a text (not normalized) (see our recent paper for more on this metric for complexity estimation)
The frequency of personal pronouns (not normalized)
The frequency of function words (not normalized)
A measure of readability based on the average sentence length (ASL), and the average syllables per word (word length)(ASW), with a higher weight on the word length (Crossley et al., 2011). It should be noted that the weight on word lengths is higher in the Flesch Reading Ease score compared to the Flesch-Kincaid Grade Level. It returns a readability score between 0 and 100, where higher scores are better (Hartley, 2016). The formula is:
Flesch Reading Ease =206.835 - (1.015 * sentence length) + (84.6 * word length)
Why it was selected It’s one of the most common scores and has in several publications been argued to be the best measure compared to other readability scores (see Hartley, 2016) It does not return a US grade (compared to other scores), which might be a bit difficult to interpret, but instead returns a score of ease. In this measure, unlike all other readabilities, higher is easier.
What to be aware of (also described in Hartley, 2016) The score might be outdated and has several issues, which also apply to other readability scores (Hartley, 2016): Many syllables does not mean that a word is more difficult to understand The meaning of words is not taken into account There are individual differences between readers
A revised version of the Flesch Reading Ease score. Like the former, it is based on the average sentence length (ASL), and the number of syllables per word (ASW). It also weighs word length more than sentence length, but the weight is smaller compared to that in the Flesch Reading Ease Score. It returns a US grade level (Crossley et al., 2011). The formula is:
Flesch Kincaid Grade Level =(0.39 * sentence length) + (11.8 * word length) -15.59
Why it was selected It’s also one of the most common and traditional scores to assess readability
What to be aware of See Flesch Reading Ease above The score was initially developed for document for the US Navy, so it might be questioned how well it applies to literature
A readability score introduced by McLaughlin. It measures readability based on the average sentence length and number of words with more than 3 syllables (number of polysyllables), and returns a US grade. However, it does this by defining all words with 3 or more syllables as polysyllables, rather than using word length as a continuous measure. It was developed as an easier (and more accurate) alternative to the Gunning Fog Index, and is based on the McCall-Crabbs Standard Test Lessons in Reading (Zhou et al., 2017). The formula is:
SMOG Index = 1.0430 * number of polysyllables * 30number of sentences+ 3.1291
Why it was selected The main reason for selecting this measure was as a (better) alternative to the Gunning Fog Index, and as an alternative to the Flesch scores McCall-Crabbs Standard Test Lessons in Reading contain non-fiction but also fiction texts, which might be relevant for the texts we are looking at
What to be aware of The SMOG Index is widely used for health documents, so it is unclear how accurate this score is when it is applied to literature The McCall-Crabbs Standard Test Lessons in Reading have been revised multiple times, which means that the formula itself might also be inaccurate (Zhou et al., 2017)
A readability score based on the average sentence length and number of characters per words (word length), and returns a US grade. However, the word length is not defined by the number of syllables, but by the number of characters in the word. It was developed to test readability of documents from the US Air Force, and was defined using 24 books and their associated grade levels (Zhou et al., 2017). The formula is:
ARI = 4.71 characterswords + 0.5 wordssentences -21.43
Why it was selected It was mostly selected as it uses an alternative measure of word length, compared to the Flesch scores and the SMOG Index
What to be aware of Since it was developed for rather technical documents it may be debated how well it applies to literature
A 1995 revision of the Dale-Chall readability score. It is based on the average sentence length (ASL) and the percentage of "difficult words" (PDW) which were defined as words which do not appear on a list of words which 80 percent of fifth-graders would know, contained in the Dale-Chall word-list.
The Dale-Chall Readability Score also returns a US grade, but is different from all other scores, as it does not determine difficulty of words based on their length, but based on a predefined list. The raw score is adjusted, by adding 3.6365, if the number of difficult words (all words not on the list of familiar words) is above 5%. The formula to compute the raw score is as follows:
New Dale Chall Readability Score = 0.1579 (difficult wordswords*100) + 0.0496 (wordssentences)
Why it was selected This score was mainly selected as it addresses an issue of all other scores, namely that long words are not necessarily difficult to understand (e.g. interesting is a long word, but may be familiar to many and thus easy to read)
What to be aware of The list of familiar words may not apply to all students and genres of text Since the list of familiar words is based on 5th grade students, this index may be most relevant in the given age group
the sentence-based sentiment arcs of the novels, using the nltk implementation of VADER, arguably one of the most widespread dictionary-based methods. We provide the full version of the arcs and their coarser-grain representation in twenty segments, as well as simpler and more complex features of the sentiment arcs of novels.
These are based on simple scores of the VADER sentiment annotation for valence.
Mean sentiment of all sentences in text
SD of sentiment in text (sentence-based)
Mean sentiment of the last 10% of each text
Mean sentiment of the first 10% of each text
Difference in mean sentiment between the main chunk of the text and the last 10% of the text
List of sentiment valence means of each segment when splitting texts into 20 segments
Compressibility of the sentiment-arcs as calculated by dividing the original bitsize of the arcs with the compressed bitzsize (using bzip2 compression).
We use the bz2 module
Settings: bz2.compress(text.encode(),compresslevel=9)
Hurst exponent of sentiment arcs, using Adaptive Filtering for detrending arcs. Details of the method are to be found in this 2021 paper and in a blogpost.
Same as above, but using the Syuzhet package (custom dictionary) to extract valence scores from sentences
Approximate Entropy of sentiment arcs calculated per 2 sentences. Sentiment arcs were exctracted with the Vader-lexicon.
Approximate entropy is a technique used to quantify the amount of regularity and the unpredictability of fluctuations over time-series data.
We compute ApEn with Neurokit2
Settings : app_ent = nk.entropy_approximate(sentarc, dimension=2, tolerance='sd')
Same as above but using the Syuzhet package (custom dictionary) to extract valence scores from sentences
Emotion arcs are available for the full corpus, which were extracted by a method combining NRC's emotion lexicon and word embeddings. The full dataset of arcs is available in our repository EmoArc.
Perplexity as a measure can for a well-trained model, be used to approximate how surprising or complex a text can be for humans. Our 2024 paper details the procedure for extracting perplexity scores and outlines the possible applications of this measure for literary texts. See the paper for further details.
The perplexity as mesured via the self-trained model (smallest model with controlled training - the texts of the corpus were definitely not in this model's training)
The perplexity as mesured via the small GPT2 model
The perplexity as mesured via the large GPT2 model
The quality metrics that we have collected belong to two main types: crowd-based, representing the result of many unfiltered readers (scores, counts), and, on the other hand, expert-based, drawn from prestigious proxies curated by experts, often institutionally affiliated (lists, series, etc.). It should be noted that this distnction is heuristic above all else, as various metrics, such as translation counts, are both subject to expert choice and the taste judgements of a larger readership.
These are all title-based (except for WIKI page rank)
“Libraries” corresponds to the number of library holdings as listed in WorldCat.
Number of ratings for title on Goodreads. Numbers retrieved in December 2022.
Average rating of title on Goodreads. Numbers retrieved in December 2022.
Average rating of title on Audible. From a large audible dataset
663 in Chicago
Number of ratings for title on Audible. From a large audible dataset 663 in Chicago
Category ("genre") assigned on Audible From a large audible dataset 663 in Chicago
Number of translations for title as listed in Index Translationum, which lists translations in the period 1979-2019
5082 in Chicago > 0
NB. Author-based
An author's "PageRank Complete" at Wikipedia, based on data from the [World Literature group)[https://arxiv.org/pdf/1701.00991.pdf] who used wikipedia page-ranks. An author has a high PageRank if many other articles with a high PageRank link to it.
3558 in Chicago > 0
Distributions of ratings per book on GoodReads. Numbers retrieved in November 2023.
They are saved as a dictionary in each row, where, e.g., '5': 300 means 300 ratings gave 5 stars, and so on for '4':300 etc. Note: keys are strings.
Author-based Authors mentioned on the Goodreads-classics-list are marked 1.
62 in Chicago
Author-based Authors mentioned on The Best Books of the 20th Century list are marked 1.
44 in Chicago
Author-based Works that also appear in the top 1000 titles on the Opensyllabus list of English Literature are marked 1.
477 in Chicago
Author-based Authors mentioned in the 10th edition of the Norton Anthology of English Literature (British & American literature) are marked 1.
339 in Chicago
Author-based Authors mentioned in the 10th edition of the Norton Anthology of American Literature are marked 1.
62 in Chicago
Author-based Norton english and Norton american combined
401 in Chicago
Title-based Titles that have been published in the Penguin Classics series (1326 titles in total)
77 in Chicago
Author-based Authors that have been published in the Penguin Classics series (1326 titles in total)
335 in Chicago
Title-based
Extracted from: database of 20th-century American bestsellers via Publishers Weekly (1900-1999), collected by John Unsworth of University of Illinois.
176 in Chicago
Title-based
Extracted from: database of New York Times Bestsellers (1931-2024) compiled by Hawes Publications.
154 in Chicago
Merged from the "PUBLISHERS_WEEKLY_BESTSELLERS" and the "NYT_BESTSELLERS"
228 in Chicago
Author-based Nobel-prize winners works are marked 1.
85 in Chicago
Longlisted works Title-based Works shortlisted (winners) for the Pulitzer Prize are marked W, and works that were longlisted (finalists) are marked F.
53 in Chicago
Longlisted works for the National Book Award Title-based Works shortlisted (winners) for the NBA are marked W, and works that were longlisted (finalists) are marked F.
108 in Chicago
Longlisted works Title-based
(1953-2022) Works shortlisted (winners) for the Hugo Awards are marked W, and works that were longlisted (finalists) are marked F.
96 in Chicago
Shortlisted works (Scifi) Title-based Locus award for best scifi novel 1980-2022
12 in Chicago
Longlisted works (Scifi) Title-based
Nebula awards 1966-2022
92 in Chicago
Longlisted works (Scifi) Title-based
US Scifi award 1982-2022
4 in Chicago
Longlisted works (Scifi) Title-based
Scifi award 1973-2022
35 in Chicago
Longlisted works (Scifi) Title-based
US "libertarian" scifi award 1979-2022
20 in Chicago
Shortlisted works Title-based
5 in Chicago
Shortlisted works Title-based British Fantasy Awards (aka. the August Derleth Fantasy Award) 1972-2022
3 in Chicago
Longlisted works Title-based
Fantasy award 1975-2022
28 in Chicago
Longlisted works (Fantasy) Title-based
US fantasy award 1971-2022
5 in Chicago
Shortlisted works Title-based
Locus awards for horror fiction/dark fantasy (1989-2022)
5 in Chicago
Longlisted works Title-based
Award for dark & horror fiction (1987-2022)
14 in Chicago
Shortlisted works (Mystery (Crime, etc.))
10 in Chicago
Title-based
Combination of 'NEBULA', 'LOCUS_SCIFI', 'HUGO', 'PHILIP_K_DICK_AWARD', 'J_W_CAMPBELL_AWARD', 'PROMETHEUS_AWARD'
163 in Chicago
Title-based
Combination of 'BRAM_STOKER_AWARD', 'LOCUS_HORROR'
19 in Chicago
Title-based
Combination of 'LOCUS_FANTASY', 'BFA', 'WORLD_FANTASY_AWARD', 'MYTHOPOEIC_AWARDS'
40 in Chicago
Author-based
Combination of 'RITA_AWARDS_AUTH' or 'RONA_AWARDS_AUTH'
54 in Chicago
Consists of the Norton (both english and american), the OpenSyllabus, and the Penguin Classics Titlebased, indexing 1 for books in either of these 3
All literary prizes (i.e., Pulitzer and NBA)
All genre prizes taken together
Collected by using genderize
Just publication date by decade
IDs of the books as they are assigned on GoodReads. Retrieved in November 2023.
We extracted some measures that are not yet in the presented dataframe but we plan to add them soon. Here is an overview:
Most of the reamining textual measures delve into the semantic profile of the novels. To do an approximate semantic tagging of the novels, we used three different resources.
The Roget thesaurus is an old and rather prestigious division of the English vocabulary into semantic category. Each word in the thesaurus is associated to one or more semantic classes, such as "dog" to ANIMAL, "happiness" to EMOTION and so forth. In total, there are slightly more than 1000 different classes. The Roget thesaurus was started at the beginning of the XIX century and is reguarly updated. While there are several semantic labelling schemes and it's hard to decide which one is the best - as they are all to an extent idiosyncratic - we chose the Roget due to its good performance in previous works on literary reception. Our experiments have also confirmed that the semantic profiles derived from novels through this resource are quite valuable for tasks of regression and classification.
We used WordNet to bring each word in the novels to a set of hypernyms. WordNet is one of the best known and most used resources for semantic analysis of texts. It was based on cognitive linguistic insights on how concepts might be organized in the mind and it was build manually by expert lexicographers. It has been regularly updated and refined since the early 1990s. WordNet structures the lexicon in a tree-like architecture, where hypernyms (e.g. ANIMAL) work as "higher nodes" and have hyponyms (e.g. DOG) as their children.
We scraped the website StoryGraph to obtain a new wealth of information about the books' perception and reception by crowds of readers. The main elements we drew from StoryGraph are:
Just as in GoodReads, StoryGraph provides a syntetic scale of appreciation of the novel together with the number of people who graded it.
Each novel was manually assigned to one or more genres, allowing us to map all the titles to putative genre categories.
Readers tag the novels as slow, medium or fast-paced. It is possible to see the percentages of these three labels assigned to each novel.
In a similar fashion as with Pace, readers tag the novels as more character- or story-driven. They also report on whether they found the character set diverse, whether the main character undergoes a substantial evolution, and whether they found the main characters likable.