-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Havelovewilltravel is a project to extract, automatically aggregate (deduplication, data normalisation, ...) and manually clean ConcertAnnouncements from gigfinder platforms such as Facebook, SongKick, BandsInTown and Setlist.fm. Artists and their references to these gigfinder platforms are maintained via Musicbrainz.
The aforementioned gigfinder platforms contain ConcertAnnouncements in great amounts. Some of the announcements are duplicated across the platforms as artists, their management or the concert halls promote the event. Havelovewilltravel does a best effort to automatically de-duplicate and normalise such ConcertAnnouncements into Concerts, which may then refer to (multiple) ConcertAnnouncement(s).
Simultaneously, other information is also subjected to automatic best efforts to maintain data quality, i.e. Venue, Organisation and Location information. However, despite these automatic data quality rules, some manual quality assurance is needed.
The speed and quantity with which the information can be aggregated prohibits us from maintaining the data in spreadsheets. Therefore, we are developing a Data Management Tool for ConcertAnnouncements.
"hlwtadmin" is a Python on Django web application that handles
- the automatic extraction of concert announcements (via APIs and screenscraping) from gigfinder websites,
- a rule engine to improve data quality semi-automatically, and
- a dashboard and tooling for manual Quality Assurance.
Future developments should envisage
- decoupling the rule engine from the interface
- LOD endpoint
- JSON-LD schema in html pages
- REST-API
The model consists of two levels that are intertwined. At the bottom, we have a "low leve" data model that is as close as possible to the concertannouncement data that we can capture from the data providing websites such as Songkick or Bandsintown. At the top, we have a "high level" model that aims to model clean data.
To describe the raw data as it comes in from the gigfinder websites, we only need a very simple model, with at its core, these three entities:
- ConcertAnnouncement: a simplified schema for capturing the temporal information of a concert, and link it to an artist and a venue
- Artist: basically a copy of a MusicBrainz Artist ID
- Venue: a string representation of the raw venue information
Hereby, a ConcertAnnouncement is related to one Artist and one Venue, through a ForeignKey Relation.
To keep track of the source of the low-level ConcertAnnouncement, we need some additional models. The philosophy behind the higher level data model is inspired by Musicbrainz. We work with a number of core entities, and there are relations possible between all core entities.
The core entities are
- Artist: which is the same as the Artist model for the low-level data
- Concert: a thin model for holding temporal information about a concert, and which serves as a crossroads for the relations (see below)
- Organisation: a model to hold information about venues, festivals, arenas, etc. (these types of organisations are available as Organisation Types)
- Location: the geographical information. For reasons of normalisation, we also employ a Country model.
The core entities are among each other related via seperate tables:
- RelationArtistArtist: to express artist to artist relations, e.g. type "also performs as"
- RelationConcertArtist: to express concert to artist relations e.g. type "main artist"
- RelationConcertOrganisation: to express concert to organisation relations, e.g. type "was held at"
- RelationConcertConcert: to express concert to concert relations, e.g. type "support of"
Relations between Concerts and Artists/Organisations contain a field "credited as". This is useful for expressing that an Artist with an "official name" X performs at a certain concert as Y, e.g. "Hi Hawaii" performed a concert as "Geroezemoes".
These two levels do not live independently from each other. Several links exist, foremost:
- Artist: this model is shared between the two levels.
Then there are a number of ForeignKey relations:
- ConcertAnnouncement > Concert: each concertannouncement should be related to exactly one Concert through a foreign key.
- Venue > Organisation: a venue as reported on the gigfinder website should resolve to exactly one Organisation through a foreign key.
- Venue > Location: a venue as reported on the gigfinder website should also be related to exactly on Location through a Foreign Key.
- Organisation > Location: an organisation is located at exactly one Location. The shared relation between an Organisation and a Venue with a Location is helpful for the deduplication automatization, explained elsewhere.
Below, you find a simplified ER diagram of the application, which visualizes the structure explained above.
TODO: added until_date to concert announcement and concert, and a is_festival boolean to concertannouncement.
The code for this visualisation is as follows:
Table GigFinder {
id int [pk]
name varchar
base_url varchar
api_key varchar
}
Table Artist {
mbid varchar [pk]
name varchar
disambiguation int
include bool
genre_id int [ref: > Genre.id]
}
Table GigFinderUrl {
id int [pk]
gigfinder_id int [ref: > GigFinder.id]
artist_id int [ref: > Artist.mbid]
url varchar
last_confirmed_by_musicbrainz datetime
last_synchronized datetime
}
Table ConcertAnnouncement {
id int [pk]
title varchar
date date
time time
gigfinder_id int [ref: > GigFinder.id]
raw_venue_id int [ref: > Venue.id]
artist_id int [ref: > Artist.mbid]
gigfinder_concert_id int
concert_id int [ref: > Concert.id]
last_seen_on datetime
seen_count int
ignore bool
created_at datetime
updated_at datetime
latitude float
longitude float
}
Table Venue {
id int [pk]
raw_venue varchar
raw_location varchar
organisation_id int [ref: > Organisation.id]
non_assignable bool
}
Table Concert {
id int [pk]
title varchar
date date
time time
cancelled bool
verified bool
ignore bool
created_at datetime
updated_at datetime
genre_id int [ref: > Genre.id]
latitude float
longitude float
annotation varchar
}
Table Organisation {
id int [pk]
name varchar
disambiguation varchar
latitude float
longitude float
location_id int [ref: > Location.id]
organisation_type_id int [ref: > OrganisationType.id]
start_date datetime
start_date_precision int
end_date datetime
end_date_precision int
website varchar
verified bool
genre_id int [ref: > Genre.id]
active bool
capacity varchar
annotation varchar
address varchar
}
Table OrganisationType {
id int [pk]
name varchar
}
Table Location {
id int [pk]
city varchar
zipcode varchar
latitude float
longitude float
country_id int [ref: > Country.id]
subcountry varchar
verified bool
}
Table RelationConcertArtist {
id int [pk]
artist_id int [ref: > Artist.mbid]
artist_credited_as varchar
concert_id int [ref: > Concert.id]
relation_type_id int [ref: > RelationConcertArtistType.id]
}
Table RelationConcertArtistType {
id int [pk]
name varchar
}
Table RelationConcertOrganisation {
id int [pk]
relation_type_id int [ref: > RelationConcertOrganisationType.id]
concert_id int [ref: > Concert.id]
organisation_id int [ref: > Organisation.id]
organisation_credited_as varchar
verified bool
}
Table RelationConcertOrganisationType {
id int [pk]
name varchar
}
Table RelationArtistArtist {
artist_a int [ref: > Artist.mbid]
artist_b int [ref: > Artist.mbid]
start_date date
start_date_precision int
end_date date
end_date_precision int
relation_type_id int [ref: > RelationArtistArtistType.id]
}
Table RelationArtistArtistType {
id int [pk]
name varchar
}
Table RelationOrganisationOrganisation {
organisation_a int [ref: > Organisation.id]
organisation_b int [ref: > Organisation.id]
start_date date
start_date_precision int
end_date date
end_date_precision int
relation_type_id int [ref: > RelationOrganisationOrganisationType.id]
}
Table RelationOrganisationOrganisationType {
id int [pk]
name varchar
}
Table RelationConcertConcert {
concert_a int [ref: > Concert.id]
concert_b int [ref: > Concert.id]
start_date date
start_date_precision int
end_date date
end_date_precision int
relation_type_id int [ref: > RelationConcertConcertType.id]
}
Table RelationConcertConcertType {
id int [pk]
name varchar
}
Table Country {
id int [pk]
name varchar
iso_code varchar
}
Table Genre {
id int [pk]
name varchar
}
Data model
- Concert
- Artist
- ConcertAnnouncement
- Country
- Genre
- GigFinder
- GigFinderURL
- Location
- Organisation
- Organisation types model
- Venue
Relations
- Relation Concert - Artist
- Relation Concert - Organisation
- Relation Concert - Concert
- Relation Artist - Artist
- Relation Organisation - Organisation
- Relation Organisation - External Identifier
Semantics
- Organisation types semantics
- Concert - Organisation semantics
- Concert - Artist semantics
- Artist - Artist semantics
- Concert - Concert semantics
- Organisation - Organisation semantics
Merge functionalities
Automation
- Venue to Organisation
- Venue to Location
- Concert Announcement to Concert
- Delete concerts and ignore ConcertAnnouncements
- Delete concerts and delete ConcertAnnouncements
QA Lists and procedures
- Concerts without Organisations
- Concerts without Artists
- Organisations without Concerts
- Concerts with multiple organisations in different locations
- Organisations without latitude and longitude
- Organisations without genre
- Organisations without disambiguation
- Concerts without latitude and longitude
- Concerts without genre
- Concerts without titles
- Artists without genre
- Announcements wo/ concert
- Concerts wo/ announcements
- Venues wo/ organisation
- Organisations wo/ venues
- Organisations without locations
- Organisations without concerts
- Cities without countries
Batch operations
Development