Most novels are, in some way, a description of a social network. Bookworm ingests novels, builds a solid version of their implicit character network and spits out a intuitively understandable and deeply analysable graph.
- bookworm for the code itself.
- Notebooks including example usage (with a load of interwoven description of how the thing actually works), in jupyter notebook form. Start Here
- data for a description of how to get hold of data so that you can run bookworm yourself.
The bookworm('path/to/book.txt')
function wraps the following steps into one simple command, allowing the entire analysis process to be run easily from the command line
python run_bookworm.py --path 'path/to/book.txt'
- Add
--d3
to format the output for interpretation by the d3.js force directed graph - Add
--threshold n
where n is an integer to specify the minimum character interaction strength to be included in the output (default 2) - Add
--output_file 'path/to/file'
to specify where the .json or .csv should be left
Start by loading in a book
book = load_book('path/to/book.txt')
Split the book into individual sentences, sequences of n
words, or sequences of n
characters by respectively running
sequences = get_sentence_sequences(book)
sequences = get_word_sequences(book, n=50)
sequences = get_character_sequences(book, n=200)
Manually input a list of character names or automatically extract a list of 'plausible' character names by respectively using
characters = load_characters('path/to/character_list.csv')
characters = extract_character_names(book)
Find instances of each character in each sequence with find_connections()
, enumerate their cooccurences with calculate_cooccurence()
, and transform that into a more easily interpretable format using get_interaction_df()
df = find_connections(sequences, characters)
cooccurence = calculate_cooccurence(df)
interaction_df = get_interaction_df(cooccurence, characters)
The resulting dataframe can be easily transform into a networkx graph using
nx.from_pandas_dataframe(interaction_df,
source='source',
target='target')
From there, all sorts of interesting analysis can be done. See the project's associated jupyter notebooks and the networkx documentation for more details.
I presented a bunch of this stuff at