DataBased is a set of scripts that will scrape a hip-hop dataset to be used at your discretion.
If you do not want to build your own set, exported collections can be found in JSON format in the Raw_JSON
archive.
To build the set in MongoDB, see INSTRUCTIONS.md.
#Schema
-
Artists
- genres (array of strings)
- related artists (array of artists with genres, names, spotify info) (max 20)
- Spotify ID (as "id")
- ID on Genius
- last.fm tags (count, url to tag, tag name)
-
Songs
- title
- url to lyrics on Genius
- Genius name of artist associated with song
- Genius ID of song
-
Lyrics
- Genius ID of song
- text
- title of song
In the Goodies folder, you will find wordclouds generated using WordCloud.py, a graph of related artists generated in R, and samples of lyrics generated by neural networks trained on specific artists (using char-rnn)
#TO-DO:
- Scrape audio-features for songs from spotify
- Run my own analytics, including:
- swearing metrics for songs/artists
- unique word counts for artists for set lyric set size
- references to places
- etc...