Skip to content

2. Basic Usage

Liam Bindle edited this page Feb 17, 2022 · 14 revisions

In this section we will go through basic usage of the bashdatacatalog to download several data collections using a catalog file. Before we begin, here is an overview of some terminology:

data collection: A data directory. A data collection may have any number of files, any types of files, and have subdirectories.

catalog file: A file that groups data colleciton together, and includes some details about the collections. A catalog file includes (1) the local paths to data collections, (2) the URLs of the data sources, and (3) boolean flags to enable/disable data collections.


Step 1: Download a catalog file

Download catalog1.csv. This file has the details of the data collections we will download in the next step. You can download this file in your terminal like so

$ # Download the example catalog file (catalog1.csv)
$ curl https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/catalog1.csv -o catalog1.csv

You might notice that this is a CSV file. You can open it with a text editor to see what it looks like. Column 1 are the paths to the data collections we will soon download, column 2 are the URLs of the collections' sources, and column 3 are boolean flags that you can edit to enable or disable each collection.

Step 2: Fetch the metadata for each collection

Before you can run commands that list the data files that are missing, you need to download (fetch) the metadata for each collection. This is done with the bashdatacatalog-fetch command:

$ # Fetch the metadata for each collection
$ bashdatacatalog-fetch catalog1.csv
Click to see expected output
Fetching metadata from 'https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/'
Fetching metadata from 'https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/'

After running this command, you will notice that directories for each collection have been created. These directories contain hidden files with the metadata needed to determine differences between local copies of the collections, and the remote collections. See the specification for more details.

(Optional) Step 2.1: List all files in the collections

You can list all the data files in the remote collections with

$ # Lists all (-a) of the data files in catalog1.csv
$ bashdatacatalog-list -a catalog1.csv
Click to see expected output
./collection1/file1
./collection1/file2
./collection1/file3
./collection1/sub1/subfile1
./collection1/sub1/subfile2
./collection1/sub1/subfile3
./collection2/2018/file-20181005
./collection2/2018/file-20181105
./collection2/2018/file-20181205
./collection2/2019/file-20190203
./collection2/2019/file-20190403
./collection2/2019/file-20190803
./collection2/file1
./collection2/file2
./collection2/file3

(Optional) Step 2.2: List all the files that are missing locally

You can list the data files that are missing from your local collections with

$ # Lists all (-a) data files that are missing (-m) locally
$ bashdatacatalog-list -am catalog1.csv
Click to see expected output
./collection1/file1
./collection1/file2
./collection1/file3
./collection1/sub1/subfile1
./collection1/sub1/subfile2
./collection1/sub1/subfile3
./collection2/2018/file-20181005
./collection2/2018/file-20181105
./collection2/2018/file-20181205
./collection2/2019/file-20190203
./collection2/2019/file-20190403
./collection2/2019/file-20190803
./collection2/file1
./collection2/file2
./collection2/file3

The output from Step 2.1 and Step 2.2 should be the same because we haven't downloaded any data yet (see Step 3).

Step 3: Download all missing files

The bashdatacatalog-list command is the workhorse of the bashdatacatalog. In short, it generates a list. You can control the list format (e.g., a list of paths or a list of URLs) and what files are included in the list (e.g., all missing files in a given date range or all files with invalid checksums) using different flags (see bashdatacatalog-list -h).

You can download all of the missing files in the data collections with

$ bashdatacatalog-list -am -f xargs-curl catalog1.csv | xargs curl
Click to see expected output
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/file1 -> ./collection1/file1  [19 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/file2 -> ./collection1/file2  [26 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/file3 -> ./collection1/file3  [26 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/sub1/subfile1 -> ./collection1/sub1/subfile1  [38 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/sub1/subfile2 -> ./collection1/sub1/subfile2  [45 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection1/sub1/subfile3 -> ./collection1/sub1/subfile3  [44 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2018/file-20181005 -> ./collection2/2018/file-20181005  [38 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2018/file-20181105 -> ./collection2/2018/file-20181105  [45 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2018/file-20181205 -> ./collection2/2018/file-20181205  [44 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2019/file-20190203 -> ./collection2/2019/file-20190203  [38 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2019/file-20190403 -> ./collection2/2019/file-20190403  [45 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/2019/file-20190803 -> ./collection2/2019/file-20190803  [44 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/file1 -> ./collection2/file1  [19 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/file2 -> ./collection2/file2  [26 bytes]
(status 200)  https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/collection2/file3 -> ./collection2/file3  [19 bytes]

See "Useful Commands" for an overview of helpful commands.

(Optional) Step 3.1: Confirm that all the files exist locally

You can confirm that all files were downloaded by running the command that lists all missing files:

$ bashdatacatalog-list -am catalog1.csv

You shouldn't see any missing files.