ANALYSIS.md

This markdown file describes the components and set up of the daily analysis CRON job that updates the ShellCast web application at https://go.ncsu.edu/shellcast.

0. Background

The main purpose of the scripts in the analysis folder are to: (1) pull rainfall data from a remote server at the North Carolina State Climate Office, (2) do some calculations with that rainfall data, and (3) update the ShellCast MySQL database based on those calculations. For a schematic representation of this workflow, including the how they relate to other major components of the ShellCast web application, see the ShellCast architecture overview flowchart.

The analysis daily CRON job ensures that the three steps descirbed above will run every day at 6am ET on the virtual computing lab server (or on a personal machine).

NOTE: At this point in time (i.e., 2020-12-18) the analysis is running on a personal computer and also the analysis weekly CRON job is not functional because the North Carolina Division of Marine Fisheries lease app REST API is not finalized/functional. Once it is, ShellCast will require a second (weekly) CRON job that will update shellfish lease information, which is required when running the rainfall calculations script (see step number 2 above).

1. List of Acronyms

North Carolina State University (NCSU)
North Carolina Division of Marine Fisheries (NCDMF)
National Digital Forecast Dataset (NDFD)
North Carolina State Climate Office (SCO)
Virtual Computing Lab (VCL)
Google Cloud Platform (GCP)

2. Description of Analysis Scripts

ndfd_get_forecast_data_script.py - This script gets the NDFD .bin file from the SCO server and converts it to a pandas dataframe. This script is run daily.
ndfd_convert_df_to_raster_script.R - This script converts the NDFD pandas dataframe to a raster object that is used for downstream R analysis. This script is run daily.
ndfd_analyze_forecast_data_script.R - This script takes the raster object as well as other spatial information about the NC coast (shellfish growing area boundaries, conditional management boundaries, lease boundaries, etc.) and does calculations for each scale so they can be used to update the ShellCast MySQL database. This script is run daily.
gcp_update_mysqldb_script.py - This script takes the data outputs from the analysis script and pushes them to the ShellCast MySQL database. This script is run daily.
ncdmf_tidy_sga_data_script.R - This script takes the NCDFM shellfish growing area boundaries spatial dataset and cleans it up for use with the analysis script listed above (number 3). This script is run annually when shellfish growing areas change.
ncdmf_tidy_cmu_bounds_script.R - This script takes the NCDFM conditional management unit boundaries spatial dataset and cleans it up for use with the analysis script listed above (number 3). This script is run annually when conditional management units change.
ncdmf_get_lease_data_script.R - This script is not yet included but will be created when the NCDMF finalizes the REST API for its publicly available lease dataset. This dataset is available in a viewer tool here. This script will run weekly to incorporate changes to leases made by NCDFM.
ncdmf_tidy_lease_data_script.R - This script takes the NCDFM shellfish lease boundaries spatial dataset and cleans it up for use with the analysis script listed above (number 3). This script will run weekly when conditional management units change.

NOTE: Scripts 1 through 4 are run daily while scripts 5 through 8 are run periodically (weekly or annually, depending on the script). See full script descriptions for specific timing details.

3. VCL Set Up

THIS DOCUMENTATION SECTION IS STILL IN PROGRESS.

Setting up the VCL using NCSU computing resources frees up use of a work machine and also ensures more consistent working conditions because the work machine doesn't have to be constantly on to run. There are two major steps to setting up the analysis CRON job on a VCL machine: (1) create the image and (2) launch the image as a server.

3.1 Creating the VCL Image

Go to VCL at NCSU, click on "Reservations" and login using your Unity ID and password. After login in, click on "Reservations" again and then "New Reservation". A window will pop up and you want to select "Imaging Reservation" with Ubuntu 18.04 LTS Base and choose "Now" and a duration that's appropriate for set up--at least 1 to 3 hours is recommended (Figure 1.). Then click "Create Reservation". You will need to wait a few minutes while this image is created. Click on "Connect!" and you will see a pop up window with more information on how to connect (Figure 2.). Then you will need to copy the ip address of the image for use in the next step.

Figure 1. New VCL image reservation options.

Figure 2. VCL image IP information.

Once you created the image and have an IP address, open up a new terminal window, and follow the following steps.

# 1. secure connect to the VCL image using <your unity ID>@<the IP address>
ssh unityid@ip

# 2. download Mini Conda 3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# 3. execute Mini Conda 3 download
bash Miniconda3-latest-Linux-x86_64.sh

# 4. exit out of environment and secure connect back in using step 1

# 5. clone the shellcast repo into the image
git clone https://github.ncsu.edu/biosystemsanalyticslab/shellcast.git

# 6. go into the shellcast directory
cd shellcast

# 7. track and checkout the vcl branch
git checkout --track origin/vcl

# 8. if step 7 doesn't work then run this step and try step 7 again
git fetch --all

# 9. return to the home directory
cd

# 10. copy the shellcast environment yaml set up file into the home directory
cp shellcast/analysis/shellcast-env.yml shellcast-env.yml

# 11. use F to create an environment (i.e., install packages and versions that are compatible) based on the requirements in the shellcast environmental yaml file  
conda env create --prefix /home/ssaia/env_shellcast -f shellcast-env.yml
# The user will have to replace "ssaia" with their Unity ID. The `--prefix` means that the environment will only be activated in this particular location.

# 12. activate the environment you created
conda activate /home/ssaia/env_shellcast
# The user will have to replace "ssaia" with their Unity ID.

# 13. see that the packages are loaded
conda list --explicit

# 14. nativate into the shellcast directory
cd shellcast

# 15. copy the Python config file template into the analysis directory
cp config-template.py ./analysis/config.py

# 16. navigate into the analysis directory and fill in the missing parts of the config.py file using nano
cd analysis
nano config.py

3.2 Creating the VCL Server Reservation

THIS DOCUMENTATION SECTION IS STILL IN PROGRESS.

4. CRON Job Set Up

If VCL options are not available, the CRON job can be set up on a work computer. This case applies to a Mac machine running macOS Mojave version 10.14.6 with a 2.3 GHz Intel Core i5 processor, 16 GB 2133 MHz DDR4 memory, and Intel Iris Plus Graphics 640 1536 MB graphics card.

To set up the CRON job on the work Mac, there are three main steps: (1) setting up your local machine to run ShellCast analysis scripts, (2) setting up GCP credentials so the ShellCast MySQL database can be updated daily via the Python script, and (3) scheduling the CRON job.

4.1 Setting Up Python and R

For a full explanation of how to set up your local machine to run ShellCast analysis scripts see DEVELOPER.md.

4.2 Setting Up GCP Credentials

Run the code below in the command line. You're web browser will pop open and you'll need to give permission to sign into the email account associated with your account on the ShellCast GCP project. You will need administrator privledges with the ShellCast web application to update the MySQL database.

gcloud auth application-default login

Copy the location of the json credential file and keep that in a safe location in case you need it later. It will look something like "/Users/sheila/.config/gcloud/application_default_credentials.json". The json file will be downloaded to /Users/username/.config/gcloud/application_default_credentials.json following authentication.

4.3 Scheduling the CRON Job on a Personal Machine (i.e., a Mac)

The daily CRON job uses Mac's launchd program, which should be already installed, and will run each day at 6am as long as the work/host computer is powered on and the CRON job script is still loaded. Text and email notifications are sent out at 7:00am ET by the GCP CRON job. There are several steps to scheduling the CRON job on a mac.

First, you need to give the terminal permission to run the script. On the Mac, go to Settings > Security & Privacy. Click on Full Disk Access on the left list and go to the Privacy tab (Figure 3.). Add Terminal (in Applications > Utilities) to this list. To save this you will have to sign in as an administrator to the machine you're working on. Be sure to lock the administrator privileges before you close the Settings window.

Figure 3. Full Disk Access Settings window for a Mac.

Next, running a CRON job with the launchd program requires a correctly formatted plist file (here, com.shellcast.dailyanalysis.cronjob.plist). This blog post by Cecina Babich Morrow was especially helpful and the official documentation is here. If you need help debugging the plist script, LaunchControl is a helpful app for finding errors using the trail version.

Next, you need to copy the config-template.sh file into the analysis directory, save it as config.sh, and edit the paths so they reflect those on your local machine. That process will look something like the following (in the terminal window).

# 1. copy template into the analysis folder as config.sh
cp .../shellcast/config-template.sh .../shellcast/analysis/config.sh
# make sure you replace ".../" with the full path to the shellcast repo

# NOTE: The config.sh file will be ignored by git, but it's best to not put sensitive information in since it will be printed out in the analysis output scripts.

# 2. use your favorite text editor to change "ADD_VALUE_HERE" to whatever your local machine path is for that field

# 3. save the config.sh file

Next, the bash (.sh) script you're running in the CRON job and all the other Python and R scripts that run within the bash script have to be executable. Check to see that they are executable from the terminal window using ls -l. You should see "x"s in the far left column for each file (e.g., "-rwxr-xr-x"). If it's no executable (e.g., "-rw-r--r--"), then use chmod to make each of them executable.

# make a script executable
chmod +x shellcast_daily_analysis.sh

If needed, repeat this use of chmod for each of the Python and R scripts listed below in "CRON Job Script Run Order". All of them need to be executable.

Note: I've (Sheila) successfully run the CRON job without the plist file being executable.

Next, when you're ready to run the CRON job, do the following:

In the terminal, navigate to the LaunchAgents directory.

cd ~/Library/LaunchAgents

Then if the plist file is not there, copy it to this location.

# make sure to change the "..." to the full path to the shellcast repo directory
# cp .../analysis/com.shellcast.dailyanalysis.cronjob.plist com.shellcast.dailyanalysis.cronjob.plist

# it will look something like this:
# cp /Users/sheila/Documents/github_ncsu/shellcast/analysis/com.shellcast.dailyanalysis.cronjob.plist com.shellcast.dailyanalysis.cronjob.plist

Then check that you're working with the right plist file using nano.

nano com.shellcast.dailyanalysis.cronjob.plist

Or with atom.

atom com.shellcast.dailyanalysis.cronjob.plist

Next, change the paths in the plist file so they are appropriate for your Mac machine. This includes the (1) ProgramArguments section, which is the path to shellcast_daily_analysis.sh file, (2) WorkingDirectory section, which is the path to the ShellCast analysis directory, (3) StandardErrorPath and StandardOutPath which are paths to the error (.err) and output (.out) files in the analysis data directory. Make sure to save changes to the plist file. See Description of CRON Job Outputs below for recommendations on where to direct outputs.

Then load the CRON job, run the following in the LaunchAgents directory.

launchctl load com.shellcast.dailyanalysis.cronjob.plist

To stop the CRON job, run the following in the LaunchAgents directory.

launchctl unload com.shellcast.dailyanalysis.cronjob.plist

To see if a LaunchAgent is loaded you can use the following.

launchctl list

Also, you can go to Applications > Utilities > Console and then look at system log to see current loaded and active programs.

Last, if you want to check that the plist script is loaded ok or need help debugging the plist script, LaunchControl is a helpful app for finding errors using the trial version. You can also see that the status of the cron job is "Ok" like in Figure 4.

Figure 4. LaunchControl screenshot.

If debugging (see Section 4.3 below), you can open up LaunchControl to check that that plist file is unloaded. Change the time in the plist file, load it, wait, and then check LaunchControl for status. Sometimes the errors in LaunchControl are not helpful (e.g., "Error 1") but other times it will tell you if you need to make the bash script executable. When in down you might have a process running from a previous time you tried to run the script that you have to kill. To do this use htop. Search within htop for "sql" and kill the process. Then start again with checking to make sure the script is unloaded, reload it, wait, etc. It's a little tedious...typical debugging.

Depending on the number of plist files you have on your local machine, you may need to increase the priority of the the ShellCast plist file so that the code runs faster (because it has higher priority). To do this you would need to go into the plist script, increase the priority level from 10 to 15, for example, save the plist file, unload it, reload it, and check the amount of time it takes to run the next morning.

4.3 Other Debugging

If you don't want to wait for 6am to test whether the plist script works (very likely!), you can navigate to the plist script in your LaunchAgents directory, open the plist script, and edit the time for something like 5 min in the future. I recommend saving the file, unloading, and re-loading the file to make sure the changes are present using the bash commands given above. You can also open up the LaunchControl app and check that time time changed.

Change the hour and minute integers in the section of the plist file below. For example, if you wanted to run the script at 4:05pm your local time the plist file would look like the plist file section below.

<!-- now trying to run it at 4:05pm my local time -->
  <key>StartCalendarInterval</key>
  <dict>
    <key>Hour</key>
    <integer>16</integer>
    <key>Minute</key>
    <integer>5</integer>
  </dict>

Note that hour ranges from 0-23 and minute ranges from 0-59. The timezone used will be based on whatever timezone your computer uses. Also,  is the syntax to comment in plist files.

If you're having issues with the CRON job running, you can also try running the bash script on its own and checking if a particular script it giving issues.

To run the bash script (.sh) not in a CRON job (for debugging), use the code below. This must be run from the analysis directory. Outputs from each R and Python script will be saved into the terminal_data directory.

The bash (.sh) script as well as so all the other Python and R scripts that run within the bash script have to be executable. Check to see that they are executable from the terminal window using ls -l. You should see "x"s in the far left column for the file (e.g., "-rwxr-xr-x"). If it's no executable (e.g., "-rw-r--r--"), then use chmod to make it executable.

chmod +x shellcast_daily_analysis.sh

If needed, repeat this use of chmod for each of the Python and R scripts listed below in "CRON Job Script Run Order".

To run the bash script, open the terminal in the analysis directory and type the following:

sh shellcast_daily_analysis.sh

# for debugging
# sh shellcast_daily_analysis_debug.sh

If you want to check whether the SQL database updated correctly. You can follow the following steps.

Open up a new terminal window, navigate to your home directory (type cd), and enter the code below. This assumes that you've already set up the Google Cloud TCP connection described in the DEVELOPER.md.

./cloud_sql_proxy -instances=ncsu-shellcast:us-east1:ncsu-shellcast-database=tcp:3306

Wait for it to connect. It should say something like "Ready for new connections" when you can move to the next step.
Open up Sequel Pro and enter the database name and password for the Google Cloud SQL Database. The port number is 3306.
On the top left corner of Sequel Pro click on "Choose Database..." and navigate to shellcast. You can click on the cmu_probabilites table, select the "Content" tab, and sort the "created" column to see if your scrip ran and updated the database (Figure 5). You can also look at Section 6 below for more info on how to see files that would have helpful debugging information.
Close the connection by going back to the terminal from step #1 and type control+C. You will get a message that says something like "Received TERM signal. Waiting up to 0s before terminating."

Figure 5. Sequel Pro screenshot.

5. CRON Job Script Run Order

Each day the shellcast_daily_analysis.sh, which is called in the launchcd plist file, will run the following R and Python scripts in the order noted below. For a description of each script see the script description section above.

ndfd_get_forecast_data_script.py
ndfd_convert_df_to_raster_script.R
ndfd_analyze_forecast_data_script.R
gcp_update_mysqldb_script.py

6. Description of CRON Job Outputs

The CRON job .err and .out files are exported to the folder specified in the .plist file. A second set of outputs (one for each of the four analysis scripts) is saved to the output path location noted in the config.sh file. We recommend the .err and .out files be exported to shellcast > analysis > data > tabular > outputs > cronjob_data. We recommend the config.sh outputs be exported to shellcast > analysis > data > tabular > outputs > terminal_data.

The .err file includes appended error messages for each run of the analysis. The .out file includes all messages that would have been printed in the terminal. Like the .err file, the .out file outputs are appended after each run of the analysis, which can be hard to sort through.

To make it easier to see issues with the different scripts, each run of the analysis will generate four files in the terminal_data directory. These include: 01_get_forecast_output_DATE.text, 02_convert_df_out_DATE.text, 03_analyze_out_DATE.text, 04_update_db_out_DATE.text. Each corresponds to the outputs that would have been printed in the terminal for each of the four scripts described in CRON Job Script Run Order above.

7. Pushing Changes to GitHub

When appropriate, changes need to be pushed to the NCSU Enterprise GitHub repository as well as the GitHub (public) respository as described in the DEVELOPER.md documentation.

8. Updating Leases

Untill we're able to get REST API access from the North Carolina Division of Marine Fisheries (NCDMF), we'll have to manually update the leases. We've chatted with Teri Dane and Mike Griffin of NCDMF about this and they've agreed to give us updates quarterly.

To mannually update the leases in the ShellCast SQL database, follow these steps.

Download the lease .shp file frim NCDMF to your local machine and save it (and all the associated .shp files) in the analysis > data > spatial > outputs > ncdmf_data > lease_bounds_raw directory as lease_bounds_raw.shp.
Run the ncdmf_tidy_lease_data_script.R either in the command line or in RStudio. This script will generate lease_centroids_albers.shp and lease_bounds_albers.shp in the shellcast repository. That is these files will be exported to the lease_centroids and lease_bounds directories, respectively, within the analysis > data > spatial > outputs > ncdmf_data directory.
The next day, the gcp_update_mysqldb_script.py script will check to see if there are new leases to be added to the SQL database. It will update them if there are and all other downstream analyses will be run normally with the newly updated leases.

9. Contact Information

If you have any questions, feedback, or suggestions please submit issues through the NCSU Enterprise GitHub or through GitHub (public). You can also reach out to Sheila Saia (ssaia at ncsu dot edu) or Natalie Nelson (nnelson4 at ncsu dot edu).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANALYSIS.md

ANALYSIS.md

ANALYSIS.md

Table of Contents

0. Background

1. List of Acronyms

2. Description of Analysis Scripts

3. VCL Set Up

3.1 Creating the VCL Image

3.2 Creating the VCL Server Reservation

4. CRON Job Set Up

4.1 Setting Up Python and R

4.2 Setting Up GCP Credentials

4.3 Scheduling the CRON Job on a Personal Machine (i.e., a Mac)

4.3 Other Debugging

5. CRON Job Script Run Order

6. Description of CRON Job Outputs

7. Pushing Changes to GitHub

8. Updating Leases

9. Contact Information

Files

ANALYSIS.md

Latest commit

History

ANALYSIS.md

File metadata and controls

ANALYSIS.md

Table of Contents

0. Background

1. List of Acronyms

2. Description of Analysis Scripts

3. VCL Set Up

3.1 Creating the VCL Image

3.2 Creating the VCL Server Reservation

4. CRON Job Set Up

4.1 Setting Up Python and R

4.2 Setting Up GCP Credentials

4.3 Scheduling the CRON Job on a Personal Machine (i.e., a Mac)

4.3 Other Debugging

5. CRON Job Script Run Order

6. Description of CRON Job Outputs

7. Pushing Changes to GitHub

8. Updating Leases

9. Contact Information