Skip to content

Latest commit

 

History

History
319 lines (246 loc) · 8.68 KB

05-csvpack.md

File metadata and controls

319 lines (246 loc) · 8.68 KB

Day 5 - csvpack Gem - Work with Tabular Data Packages (.CSV Files w/ datapackage.json) using SQLite In-Memory (w/ ActiveRecord)

Written by {% avatar geraldb %} Gerald Bauer

A code monkey and enthusiastic collector (and publisher) of open football and beer data. Skills include Ruby, SQLite and CSV. Spec lead of CSV <3 JSON.

Let's say you want to share your data with the world or use the data that others share with the world. How to get started? A pragmatic way is to use tabular data packages.

What's a tabular data package?

Tabular Data Package is a simple structure for publishing and sharing tabular data with the following key features:

  • Data is stored in CSV (comma separated values) files
  • Metadata about the dataset both general (e.g. title, author) and the specific data files (e.g. schema) is stored in a single JSON file named datapackage.json which follows the Data Package format

(Source: Tabular Data Packages, Data Hub)

Here's a minimal example of a tabular data package holding two files, that is, data.csv and datapackage.json:

data.csv:

Brewery,City,Name,Abv
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
Augustiner Bräu München,München,Edelstoff,5.6%
Bayerische Staatsbrauerei Weihenstephan,Freising,Hefe Weissbier,5.4%
Brauerei Spezial,Bamberg,Rauchbier Märzen,5.1%
Hacker-Pschorr Bräu,München,Münchner Dunkel,5.0%
Staatliches Hofbräuhaus München,München,Hofbräu Oktoberfestbier,6.3%
...

datapackage.json:

{
  "name": "beer",
  "resources": [
    {
      "path": "data.csv",
      "schema": {
        "fields": [ { "name": "Brewery",   "type": "string" },
                    { "name": "City",      "type": "string" },
                    { "name": "Name",      "type": "string" },
                    { "name": "Abv",       "type": "number" } ]
      }
    }
  ]
}

Where to find data packages?

For some "real world" examples see the Data Packages Listing at Data Hub's Core Datasets site for a start. Tabular data packages include:

Name Comments
country-codes Comprehensive country codes: ISO 3166, ITU, ISO 4217 currency codes and many more
language-codes ISO Language Codes (639-1 and 693-2)
currency-codes ISO 4217 Currency Codes
gdb Country, Regional and World GDP (Gross Domestic Product)
s-and-p-500-companies S&P 500 Companies with Financial Information
un-locode UN-LOCODE Codelist

and many more.

What's the csvpack gem?

Now the questions is how to work with tabular data packages in Ruby. Let's try the csvpack gem.

require 'csvpack'

CsvPack.import(
  's-and-p-500-companies',
  'gdb'
)

Using CsvPack.import will:

  1. download all data packages to the ./pack folder

  2. (auto-)add all tables to an in-memory SQLite database using SQL create_table statements via ActiveRecord migrations e.g.

create_table :constituents_financials do |t|
  t.string :symbol            # Symbol         (string)
  t.string :name              # Name           (string)
  t.string :sector            # Sector         (string)
  t.float  :price             # Price          (number)
  t.float  :dividend_yield    # Dividend Yield (number)
  t.float  :price_earnings    # Price/Earnings (number)
  t.float  :earnings_share    # Earnings/Share (number)
  t.float  :book_value        # Book Value     (number)
  t.float  :_52_week_low      # 52 week low    (number)
  t.float  :_52_week_high     # 52 week high   (number)
  t.float  :market_cap        # Market Cap     (number)
  t.float  :ebitda            # EBITDA         (number)
  t.float  :price_sales       # Price/Sales    (number)
  t.float  :price_book        # Price/Book     (number)
  t.string :sec_filings       # SEC Filings    (string)
end
  1. (auto-)import all records using SQL inserts e.g.
INSERT INTO constituents_financials
  (symbol,
   name,
   sector,
   price,
   dividend_yield,
   price_earnings,
   earnings_share,
   book_value,
   _52_week_low,
   _52_week_high,
   market_cap,
   ebitda,
   price_sales,
   price_book,
   sec_filings)
VALUES
  ('MMM',
   '3M Co',
   'Industrials',
   162.27,
   2.11,
   22.28,
   7.284,
   25.238,
   123.61,
   162.92,
   104.0,
   8.467,
   3.28,
   6.43,
   'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')
  1. (auto-)add ActiveRecord models for all tables.

So what? Now you can use all the "magic" of ActiveRecord to query the datasets. Example:

puts "Constituent.count: #{Constituent.count}"

# SELECT COUNT(*) FROM "constituents"
# => 496

pp Constituent.first

# SELECT  "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
#         id:     1,
#         symbol: "MMM",
#         name:   "3M Co",
#         sector: "Industrials">

pp Constituent.find_by!( symbol: 'MMM' )

# SELECT  "constituents".*
#         FROM "constituents"
#         WHERE "constituents"."symbol" = "MMM"
#         LIMIT 1
# => #<Constituent:0x9f8cb78
#         id:     1,
#         symbol: "MMM",
#         name:   "3M Co",
#         sector: "Industrials">

pp Constituent.find_by!( name: '3M Co' )

# SELECT  "constituents".*
#          FROM "constituents"
#          WHERE "constituents"."name" = "3M Co"
#          LIMIT 1
# => #<Constituent:0x9f8cb78
#         id:     1,
#         symbol: "MMM",
#         name:   "3M Co",
#         sector: "Industrials">

pp Constituent.where( sector: 'Industrials' ).count

# SELECT COUNT(*) FROM "constituents"
#         WHERE "constituents"."sector" = "Industrials"
# => 63

pp Constituent.where( sector: 'Industrials' ).all

# SELECT "constituents".*
#         FROM "constituents"
#         WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
#          id:     1,
#          symbol: "MMM",
#          name:   "3M Co",
#          sector: "Industrials">,
#      #<Constituent:0xa2a4180
#          id:     8,
#          symbol: "ADT",
#          name:   "ADT Corp (The)",
#          sector: "Industrials">,...]

and so on.

How to dowload a data package ("by hand")?

Use the CsvPack::Downloader class to download a data package to your disk (by default data packages get stored in ./pack).

dl = CsvPack::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode' )

Will result in:

-- pack
   |-- language-codes
   |   |-- data
   |   |   |-- language-codes-3b2.csv
   |   |   |-- language-codes.csv
   |   |   `-- language-codes-full.csv
   |   `-- datapackage.json
   |-- s-and-p-500-companies
   |   |-- data
   |   |   |-- constituents.csv
   |   |   `-- constituents-financials.csv
   |   `-- datapackage.json
   `-- un-locode
       |-- data
       |   |-- code-list.csv
       |   |-- country-codes.csv
       |   |-- function-classifiers.csv
       |   |-- status-indicators.csv
       |   `-- subdivision-codes.csv
       `-- datapackage.json

How to add and import a data package ("by hand")?

Use the CsvPack::Pack class to read-in a data package and add and import into an SQL database.

pack = CsvPack::Pack.new( './pack/un-locode/datapackage.json' )
pack.tables.each do |table|
  table.up!      # (auto-) add table  using SQL create_table via ActiveRecord migration
  table.import!  # import all records using SQL inserts
end

That's it.

Bonus: How to connect to a different SQL database?

You can connect to any database supported by ActiveRecord. If you do NOT establish a connection in your script - the standard (default fallback) is using an in-memory SQLite3 database.

SQLite

For example, to create an SQLite3 database on disk - lets say data.db - use in your script (before the CsvPack.import statement):

ActiveRecord::Base.establish_connection( adapter:  'sqlite3',
                                         database: './data.db' )

PostgreSQL

For example, to connect to a PostgreSQL database use in your script (before the CsvPack.import statement):

require 'pg'       ##  pull-in PostgreSQL (pg) machinery

ActiveRecord::Base.establish_connection( adapter:  'postgresql',
                                         username: 'ruby',
                                         password: 'topsecret',
                                         database: 'database' )

Find Out More

References