This software is released under a Creative Commons "CC0" Public Domain Dedication; see https://creativecommons.org/publicdomain/zero/1.0/ for more info.
Discogs artist, label, release, and master release data is publicly available in huge XML files produced monthly and made available at https://data.discogs.com/ for anyone to download.
This Python script parses the Discogs release data using the ultrafast cElementTree API which comes with Python 2.5 and up. The script automatically handles compressed and uncompressed data, and the old style of dump which had no root element.
To try it out, get one of the release data dump files and just run the script, passing the dump file path as the only argument. For example, if the script and the dump file named discogs_20191101_releases.xml.gz are in the same directory:
python parse_discogs_dump.py discogs_20191101_releases.xml.gz
By default, a dot will be printed to the screen for every 1000 'release' elements read. At the end it tells you how much time it took. If you interrupt it, it tells you the last release ID it saw. The parsed data is mostly ignored; the idea is just to successfully read the XML, building a temporary tree for each 'release' element.
As of 2018, on my 3.1 GHz Intel Core i5-2400 system (using only 1 of 4 cores), it takes 61 minutes to plow through the 6.0 GB gzipped release data XML and print the dots, yet it only ever uses about 17 MB of memory. It could run faster if a temporary tree was not built and discarded for each 'release', but I feel it is a better benchmark this way.
If you uncomment one line of code near the end, then instead of a dot, you can get a complete XML fragment for every 1000th release element read.
There is no need to modify parse_discogs_dump.py directly. You can write your own code to handle each 'release' element, which will be an instance of ElementTree.Element
. Your code needs to just do the following:
- Import
ElementProcessor
andprocess_dump_file()
from parse_discogs_dump.py. - Define a subclass of
ElementProcessor
with, at a minimum, aprocess()
method to handle each 'release' element (which will be an instance ofElementTree.Element
) in whatever way you want. - Pass the dump file path and an instance of your subclass to
process_dump_file()
.
For example: find_invalid_release_dates.py is a script which does exactly those things. It can be run like this:
python find_invalid_release_dates.py discogs_20191101_releases.xml.gz > report.txt
Every time it finds a non-empty release date which does not match the patterns ####
or ####-##-##
with a non-zero month value, it will print a dot to the screen, and the output file report.txt will get a line like this:
https://www.discogs.com/release/41748 - release date is "?"
Some Discogs dump files contain errors in the XML. If you get an error message about the XML not being well-formed, you will have to fix the dump file. For example, you might need to remove the control characters which are forbidden in XML:
gzcat discogs_20080309_releases.xml.gz | tr -d '\1\2\3\4\5\6\7\10\13\14\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37\177\200\201\202\203\204\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237' | gzip -9 > discogs_20080309_releases.fixed.xml.gz
I am user mjb on Discogs. Feel free to contact me there via private message, or in the API forum.