Skip to content

Scripts and schemas that aim to make data from the inventory easier to analyse

Notifications You must be signed in to change notification settings

Samsomyajit/data-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Campaign Lab Data Pipeline

What?

  • We want to be able to structure our dataset (see "Campaign Lab Data Inventory").
  • In order to do this, we first should define what the structure (schema) of the different data sources are.
  • This will help us down the line to create modules that transform our raw data into our target data, for later export into a database, R package, or any other tools for utilising the data in a highly structured and annotated format.

How can I contribute?

  • We need to go through each of the datasources that we have defined in "Campaign Lab Data Inventory", create a transformer (in the transformers folder), and associated schema for each datasource.

  • The transformer should be able to be run in a machine locally, downloading the data and transforming it into a CSV (later importing it into a local database).

  • To contribute:

    1. Open an Issue with the name of the issue formatted as description-rowIdentifier, where description and rowIdentifier are what is in the excel spreadsheet "Campaign Lab Data Inventory".
    2. Write a small description of which dataset you are trying to transform and create a schema for.
    3. Open a Pull Request (create a branch with an appropriate name) when you're finished

Formatting

  • We need to make sure we format similar fields between datasources in the same way.

  • For now, the standardization should follow:

  • Timestamp fields: 2015-06-30T22:30:00.000Z

What is a schema?

  • A schema in this case is basically just a JSON (JavaScript Object Notation) that describes the structure and format of the dataset.

  • an example schema would be

  "title": "Election results",
  "source": "https://data.police.uk/docs/method/forces/"
  "description": "A dataset of election results",
  "properties": {
    "county": {
      "type": ["string"],
      "description": "The county in which the result was"
    },
    "number_of_votes": {
      "type": ["integer"],
      "description": "The number of votes that were received"
    },
    "party": {
      "type": ["string"],
      "description": "the party which was receiving votes"
    }
  }
}
  • The title tells you the name of the dataset (you can make this up)
  • source is a link (if available) to the actual dataset.
  • The description is a one liner that describes the dataset
  • properties is a list of the datapoints that we want to end up with after transforming the raw dataset.

About

Scripts and schemas that aim to make data from the inventory easier to analyse

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 72.3%
  • Python 18.8%
  • R 8.9%