Skip to content

micbuffa/WasabiDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WASABI Song Corpus

This repository contains the files of the current version of the WASABI Song Corpus, the models we have built on it as well as updates.

Abstract

The WASABI Song Corpus is a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis.

More specifically, given that lyrics encode an important part of the semantics of a song, we focus here on the description of the methods we proposed to extract relevant information from the lyrics, such as their structure segmentation, their topics, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed.

The corpus contains 1.73M songs with lyrics (1.41M unique lyrics) annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and segmentation recommendation of songs.

Interactive explorer

The dataset can be explored using the WASABI Interactive Navigator. Beware that certain copyrighted data (ex: full length lyrics or full track audio files) are not accessible if you are not a member of the Wasabi project.

Overview

Dependencies

The packages present when successfully running the code are listed in the file pip list --local.

Usage examples

This Jupyter Notebook shows how to use the different resources.

Downloading lyrics

Initially, the songs lyrics were retrieved from the LyricsWikia service that helped bootstrap the WASABI project. However, we cannot redistribute the lyrics since these are copyrighted material.

As of 2020, LyricsWikia is no longer availabile. Some NLP researchers managed to use our ML models on full lyrics that they obtained from other sources, in particular the commercial MusixMatch service (that provides large parts of the lyrics for free) or other online web site from where they could scrap the lyrics.

We have plans to complete the dataset in the next three years (starting 2021), and we have already written scripts that use MusixMatch to perform lyrics analysis on the upcoming new content.

Citation

If you use our resource, please cite the following articles:

incollection{buffa:hal-03282619,
  TITLE = {{The WASABI Dataset: Cultural, Lyrics and Audio Analysis Metadata About 2 Million Popular Commercially Released Songs}},
  AUTHOR = {Buffa, Michel and Cabrio, Elena and Fell, Michael and Gandon, Fabien and Giboin, Alain and Hennequin, Romain and Michel, Franck and Pauwels, Johan and Pellerin, Guillaume and Tikat, Maroua and Winckler, Marco},
  URL = {https://hal.science/hal-03282619},
  BOOKTITLE = {{The Semantic Web. ESWC 2021. Lecture Notes in Computer Science, vol 12731.}},
  PAGES = {515-531},
  YEAR = {2021},
  MONTH = May,
  DOI = {10.1007/978-3-030-77385-4\_31},
  KEYWORDS = {Music metadata ; Lyrics analysis ; Named entites ; Linked data},
  PDF = {https://hal.science/hal-03282619/file/camera_ready.pdf},
  HAL_ID = {hal-03282619},
  HAL_VERSION = {v1},
}

@article{fell2019love,
  title={Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations},
  author={Michael Fell and Elena Cabrio and Elmahdi Korfed and Michel Buffa and Fabien Gandon},
  journal={arXiv},
  year={2019},
  volume={abs/1912.02477}
}

WASABI RDF Knowledge Graph

The WASABI RDF Knowledge Graph provides an RDF representation of songs, artists and albums, together with the information automatically extracted from lyrics and audio content. The dataset and ontology have the same root namespace: http://ns.inria.fr/wasabi/. All URIs are dereferenceable.

The dataset itslef is identified by URI http://ns.inria.fr/wasabi/wasabi-2-0. It comes with DCAT, VOID and SPARQL-SD descriptions.

It leverages the WASABI ontology that reuses classes and properties from other vocabularies. Not all the terms needed to describe resources were imported in the ontology. As a result, the resource descriptions use terms from multiple vocabularies whose namespaces and prefixes are given below.

@prefix af:      <http://purl.org/ontology/af/> .
@prefix chord:   <http://purl.org/ontology/chord/> .
@prefix dbo:     <http://dbpedia.org/ontology/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix mo:      <http://purl.org/ontology/mo/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema:  <http://schema.org/> .
@prefix wsb:     <http://ns.inria.fr/wasabi/ontology/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .

Song metadata

Song URIs are formatted as http://ns.inria.fr/wasabi/song/song_id where song_id is the song's WASABI unique identifier.

Each song is linked to its artist and the album in which it appears as follows:

  • mo:performer <http://ns.inria.fr/wasabi/artist/artist_id>
  • schema:album <http://ns.inria.fr/wasabi/album/album_id>

Among the song metadata, we find:

  • title (dcterms:title)
  • detected language (wsb:language_detected)
  • sound gain (wsb:gain)
  • number of explicit lyrics in the song (wsb:explicit_lyrics_count)
  • chords sequence (wsb:chord_sequence)
  • chords sequence confidence (af:confidence)
  • bpm (mo:bpm)

The ontology folder provides an example of the RDF Turtle representation of "Bad" by Mickael Jackson.

Artist metadata

Artist URIs are formatted as http://ns.inria.fr/wasabi/artist/artist_id where artist_id is the artist's WASABI unique identifier.

We distinguish 4 types of artists in the dataset:

  • wsb:Artist_Person equivalent to class mo:SoloMusicArtist of the Music Ontology,
  • wsb:Artist_Group equivalent to class mo:MusicGroup of the Music Ontology,
  • wsb:Choir and wsb:Orchestra that are subclasses of mo:MusicArtist of the Music Ontology.

In the case where the artist is a group, it is made up of members that we represent in this way:

  • schema:members <http://ns.inria.fr/wasabi/artist/artist_id>

The ontology folder provides an example of the RDF Turtle representation of Mickael Jackson.

Album metadata

Album URIs are formatted as http://ns.inria.fr/wasabi/album/album_id where album_id is the album's WASABI unique identifier.

Each album is linked to its artist as follows:

  • mo:performer <http://ns.inria.fr/wasabi/artist/artist_id>

The ontology folder provides an example of the RDF Turtle representation of albnum "HIStory" by Mickael Jackson.

Downloading and SPARQL Querying

The dataset is downloadable, as an RDF dump (in Turtle syntax) and JSON dump, from Zenodo: DOI

It can also be queried through our Virtuoso OS SPARQL endpoint http://wasabi.inria.fr/sparql.

You may use the Faceted Browser to look up text or URIs.

The following named graphs can be queried from our SPARQL endpoint:

Named graph Description
http://ns.inria.fr/wasabi/ontology/ WASABI ontology
http://ns.inria.fr/wasabi/graph/metadata dataset description (DCAT, VOID, SPARQL SD)
http://ns.inria.fr/wasabi/graph/artists artists metadata (name, genre, record label, web pages etc.)
http://ns.inria.fr/wasabi/graph/albums albums metadata (title, publication date, length etc.)
http://ns.inria.fr/wasabi/graph/songs songs metadata (title, album, publication date, etc.), chords
http://ns.inria.fr/wasabi/graph/songs-extd songs extended information: topics, emotion tags, social tags, emotion valence and arrousal

License

Creative Commons License

The Wasabi dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The code used to produce the dataset, provided in folder src, is licensed under the Apache License, Version 2.0.