Skip to content

Commit

Permalink
Merge pull request #10 from WorldFishCenter/preprocess-landings
Browse files Browse the repository at this point in the history
Pre-process landings data
  • Loading branch information
efcaguab authored Apr 5, 2021
2 parents 40980f7 + bdbdeb1 commit 7867a2d
Show file tree
Hide file tree
Showing 21 changed files with 951 additions and 42 deletions.
28 changes: 21 additions & 7 deletions .github/workflows/data-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,8 @@ jobs:
r-image-name: ${{ steps.build-docker.outputs.FULL_IMAGE_NAME }}
r-config: ${{ steps.setvars.outputs.r-config }}
steps:

- name: Checkout repository
uses: actions/checkout@v2

- name: Set variables
id: setvars
run: |
Expand All @@ -26,19 +24,16 @@ jobs:
else
echo "::set-output name=r-config::default"
fi
- name: Get smart tag for docker image
id: get-tag
uses: Surgo/docker-smart-tag-action@v1

# This step is necessary to remove the colon from the beginning of the tag
- name: Remove colon from smart tag
id: remove-tag-colon
env:
IMAGE_TAG: ${{ steps.get-tag.outputs.tag }}
run: |
echo "::set-output name=tag::${IMAGE_TAG:1}"
- name: Build image with cache
id: build-docker
uses: whoan/docker-build-with-cache-action@v5
Expand All @@ -50,6 +45,7 @@ jobs:
push_git_tag: true
dockerfile: Dockerfile.prod


ingest-landings:
name: Ingest landings
needs: build-container
Expand All @@ -64,9 +60,27 @@ jobs:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
steps:

- name: Get session info
run: Rscript -e 'sessioninfo::session_info()'

- name: Call ingest_timor_landings()
run: Rscript -e 'peskas.timor.data.pipeline::ingest_timor_landings()'


preprocess-landings:
name: Preprocess landings
needs: [ingest-landings, build-container]
runs-on: ubuntu-20.04
container:
image: ${{needs.build-container.outputs.r-image-name}}
env:
R_CONFIG_ACTIVE: ${{ needs.build-container.outputs.r-config }}
KOBO_TOKEN: ${{ secrets.PESKAS_KOBO_TOKEN }}
GCP_SA_KEY: ${{ secrets.PESKAS_DATAINGESTION_GCS_KEY }}
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Get session info
run: Rscript -e 'sessioninfo::session_info()'
- name: Call ingest_timor_landings()
run: Rscript -e 'peskas.timor.data.pipeline::preprocess_landings()'
13 changes: 10 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: peskas.timor.data.pipeline
Title: Functions to Implement the Timor Small Scale Fisheries
Data Pipeline
Version: 0.2.0
Version: 0.3.0
Authors@R:
c(person(given = "Fernando",
family = "Cagua",
Expand All @@ -15,19 +15,26 @@ Description: This package implements the data and modelling
License: GPL-3
Imports:
config,
dplyr,
git2r,
httr,
logger,
magrittr,
purrr
purrr,
readr,
stringr,
tidyr,
rlang (>= 0.1.2)
Suggests:
covr,
googleAuthR,
googleCloudStorageR,
jsonlite,
RCurl,
remotes,
sessioninfo,
testthat
testthat,
roxygen2
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ FROM rocker/geospatial:4.0.3

# Extra R packages
RUN install2.r --error --skipinstalled \
config git2r httr jsonlite logger magrittr purrr covr googleCloudStorageR RCurl remotes sessioninfo testthat
config git2r httr jsonlite logger magrittr purrr covr googleCloudStorageR RCurl readr remotes sessioninfo stringr testthat

# Rstudio interface preferences
COPY rstudio-prefs.json /home/rstudio/.config/rstudio/rstudio-prefs.json
4 changes: 2 additions & 2 deletions Dockerfile.prod
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \

# Install imports
RUN install2.r --error --skipinstalled \
config git2r httr logger magrittr purrr
config dplyr git2r httr logger magrittr purrr readr stringr tidyr rlang

# Install suggests
RUN install2.r --error --skipinstalled \
covr googleCloudStorageR jsonlite RCurl remotes sessioninfo testthat
covr googleCloudStorageR jsonlite RCurl remotes roxygen2 sessioninfo testthat

# Install local package
COPY . /home
Expand Down
24 changes: 24 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,11 +1,35 @@
# Generated by roxygen2: do not edit by hand

export("%>%")
export(":=")
export(.data)
export(add_version)
export(as_label)
export(as_name)
export(cloud_object_name)
export(cloud_storage_authenticate)
export(download_cloud_file)
export(enquo)
export(enquos)
export(expr)
export(get_host_url)
export(ingest_timor_landings)
export(preprocess_landings)
export(pt_nest_attachments)
export(pt_nest_species)
export(retrieve_survey)
export(retrieve_survey_data)
export(retrieve_survey_metadata)
export(sym)
export(syms)
export(upload_cloud_file)
importFrom(magrittr,"%>%")
importFrom(rlang,":=")
importFrom(rlang,.data)
importFrom(rlang,as_label)
importFrom(rlang,as_name)
importFrom(rlang,enquo)
importFrom(rlang,enquos)
importFrom(rlang,expr)
importFrom(rlang,sym)
importFrom(rlang,syms)
40 changes: 27 additions & 13 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,41 @@
# peskas.timor.data.pipeline 0.3.0

### New features

- The preprocessing of East Timor landings is implemented in `preprocess_landings()`
- Added `pt_nest_attachments()` to group all attachment columns into a nested column containing data frames.
- Added `pt_nest_species()` to group all attachment columns into a nested column containing data frames.
- Added `cloud_object_name()` as a complement to `add_version()` to return the latest or an specified version of an object in an storage location.
- Added `download_cloud_file()` to download files from cloud storage providers.

### Improvements

- Now using `cloud_storage_authenticate()` to internally authenticate to cloud storage instead of authenticating separately in each cloud functionjj. This simplifies authentication and ensures authentication is not attempted when credentials have been already validated.

# peskas.timor.data.pipeline 0.2.0

## Breaking changes
### Breaking changes

* `download_survey_data()`, `download_survey_metadata()`, and `download_survey()` have been renamed to `retrieve_survey_data()`, `retrieve_survey_metadata()`, and `retrieve_survey()`. This is to avoid confusion with planned functions that download data from cloud locations.
* The suffix *raw* and *metadata* that is appended to the prefix when retrieving survey information is now separated using "_" rather than "-". This is to more easily distinguish between information encoded in the file name.
- `download_survey_data()`, `download_survey_metadata()`, and `download_survey()` have been renamed to `retrieve_survey_data()`, `retrieve_survey_metadata()`, and `retrieve_survey()`. This is to avoid confusion with planned functions that download data from cloud locations.
- The suffix *raw* and *metadata* that is appended to the prefix when retrieving survey information is now separated using "_" rather than "-". This is to more easily distinguish between information encoded in the file name.

## New features
### New features

* The prefix name of surveys is not hard-coded and can be specified in the config file (`file_prefix` field).
- The prefix name of surveys is not hard-coded and can be specified in the config file (`file_prefix` field).

# peskas.timor.data.pipeline 0.1.0

Adds infrastructure to download survey data and upload it to cloud storage providers and implements the ingestion of East Timor landings.

## New features
### New features

* The ingestion of East Timor Landings is implemented in `ingest_timor_landings()`.
* The functions `download_survey_data()` and `download_survey_metadata()` which download data and metadata for an electronic survey hosted by *kobo*, *kobohr*, or *ona*.
* `download_survey()` can be used as a wrapper to download data and metadata in a single call.
* `upload_cloud_file()` can be used to upload a set of files to a cloud storage bucket. Currently only Google Cloud Services (GCS) is supported.
* `add_version()` is an utility function that appends date-time and sha information to a string and is used to version file names.
* `get_host_url()` is an utility function that gets the host url of an electronic survey provider API.
- The ingestion of East Timor Landings is implemented in `ingest_timor_landings()`.
- The functions `download_survey_data()` and `download_survey_metadata()` which download data and metadata for an electronic survey hosted by *kobo*, *kobohr*, or *ona*.
- `download_survey()` can be used as a wrapper to download data and metadata in a single call.
- `upload_cloud_file()` can be used to upload a set of files to a cloud storage bucket. Currently only Google Cloud Services (GCS) is supported.
- `add_version()` is an utility function that appends date-time and sha information to a string and is used to version file names.
- `get_host_url()` is an utility function that gets the host url of an electronic survey provider API.

## Pipeline
### Pipeline

The data pipeline is implemented and run in GitHub Actions on a schedule.
Loading

0 comments on commit 7867a2d

Please sign in to comment.