Skip to content

A small side project to maybe find some interesting word usage patterns in the web serial Worm.

License

Notifications You must be signed in to change notification settings

redoctopus/Worm-TFIDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Worm-TFIDF

A small side project to maybe find some interesting word importance patterns in the web serial Worm.

Results over the full corpus are here.

There's some inexactness because of my naive filtering (accent marks are not processed correctly), but that shouldn't affect anything too much. I may or may not come back and fix it.

May contain spoilers for Worm.

TF-IDF

Stands for term frequency-inverse document frequency. In short, it's a way to measure how important a word is to a particular document (in this case, a chapter) relative to the whole corpus (in this case, all of Worm).

As an example, a few of the words with the highest tf-idf scores for the very first chapter, Gestation 1.1, are "madison,", "juice," "lunch," "mr," and "gladly." On the other hand, the words with the highest tf-idf scores for Gestation 1.6 (when Armsmaster first shows up) are "armsmaster," "credit," "capture," "east," and "lung."

Slightly More Technical Details

Term frequency is calculated given a word and a document, such that tf(w,d) = (# of times w appears in d).

Inverse document frequency is calculated given a word, such that idf(w) = log(# of documents/# documents containing w).

Then, tfidf(w,d) = tf(w,d) * idf(w).

Words with the higher tf-idf scores per chapter are ones that appear proportionally more in that chapter than in the rest of Worm. This also conveniently means that words that are common across all chapters, such as stopwords (e.g. as, of, a, and, etc.) are naturally lower in the heirarchy and don't make it into the top 10.

Results

For the tf-idf results per chapter, computed over the full corpus (such that each chapter is compared to every other chapter in Worm), click here.

Results of running on individual arcs (such that each chapter is only compared to other chapters in the same arc):

  1. Gestation
  2. Insinuation
  3. Agitation
  4. Shell
  5. Hive
  6. Tangle
  7. Buzz
  8. Extermination
  9. Sentinel
  10. Parasite
  11. Infestation
  12. Plague
  13. Snare
  14. Prey
  15. Colony
  16. Monarch
  17. Migration
  18. Queen
  19. Scourge
  20. Chrysalis
  21. Imago
  22. Cell
  23. Drone
  24. Crushed
  25. Scarab
  26. Sting
  27. Extinction
  28. Cockroaches
  29. Venom
  30. Speck
  31. Epilogue: Teneral

About

A small side project to maybe find some interesting word usage patterns in the web serial Worm.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published