REPOSITORY CITATION
Mirzaee, Ashkan. A Python API for accessing Forest Inventory and Analysis (FIA) database in parallel (2021). https://doi.org/10.6084/m9.figshare.14687547
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Sourcecode is available under a GNU General Public License.
- Ashkan Mirzaee: amirzaee@mail.missouri.edu
The workflow in this repository is designed to run in both parallel and serial. To run this application in parallel you need a Linux based cluster with slurm job scheduling system. As a student, faculty or researcher, you might have access to your institute's cluster by using ssh username@server.domain
on a Unix Shell (default terminal on Linux and macOS computers) or an SSH Client. If you do not have access to a cluster, the workflow can be run in serial on personal computers' Unix Shell. A basic knowledge about Unix Shell and HPC can help to follow the workflow easily.
The Forest Inventory and Analysis (FIA) Program of the US Forest Service provides the information needed to assess America's forests. Many researchers rely on forest attribute estimations from the FIA program to evaluate forest conditions. US Forest Service provides multiple methods to use FIA database (FIADB) including EVALIDator and FIA Data Mart. The FIA Data Mart allows users to download raw data files and SQLite databases. Among the available formats, only SQLite database is an option for a large number of queries. To use SQLite database, users require to download entire database (10GB) and establish a local SQLite server. Beside the complexity of the query commands, the local database size growing very fast by implementing more queries. Also, Forest Service update FIADB regularly and for access to new releases users need to update the local server periodically.
On the other hand, EVALIDator relies on the HTML queries and allows users to produce a large variety of population estimates through a web-application with the lowest level of difficulty. However, EVALIDator provides a single query at the time and it makes the web-application not suitable for collecting a large FIA data. The Python API uses FIADB API to bypass the EVALIDator’s GUI and collect JSON queries from FIADB. It uses Slurm jobs to communicate with FIADB in parallel to collect large number of queries at a time.
In this project we used Python and Slurm workload manager to generate numerous parallel workers and distribute them across the cluster. The API is designed to scale up the query process such that by increasing processing elements (PE) the process expected to speedup linearly. The API can be set up and configured to be run on a single core computer or in a cluster for any given year, state, coordinate, and forest attribute. It can also search for the closest available FIA survey year to each query and provides a real-time access to most recent releases FIADB. After collecting requested information, outputs are stored on CSV and JSON files and can be used by other scientific software. The following shows features of the Python API:
- Scalable. It can be run on a single core computer or in a cluster.
- Configurable. It can be run for any given year, state, coordinate, and forest attribute.
- Real-time access. Access to the most recent releases of large variety of population estimates from FIA. The API can search for the closest available FIA survey year.
- Explanatory reports. Generating reports that show failed queries and warnings.
- Job completion. Ability to resubmit the failed jobs to collect only the missing information.
- JSON queries. It can significantly reduce size of the collecting information from the FIADB.
- CSV and JSON outputs. That can be used by other scientific software
- Easy to work with. The API designed for collecting quires in parallel with minimum difficulty.
The following shows configurable items and template values:
{
"year" : [2017,2019],
"state": ["AL","LA"],
"attribute_cd": [7,8],
"tolerance": 1,
"job_number_max": 12,
"job_time_hr": 8,
"partition": "General",
"query_type": ["state","county","coordinate"]
}
Metadata - config.json
includes:
- year: list of years. For a single year use a singleton list e.g. [2017]
- state: list of states. Use
["ALL"]
to include all states in the data. For a single state use a singleton list e.g. ["MO"] - attribute_cd: list of FIA forest attributes code (find them in Appendix O in the FIA User Guide). For a single code use a singleton list e.g. [7]
- tolerance: a binary variable of 0 or 1. Set 1 to use the closest available FIA survey year to the listed years, if the listed year is not available
- job_number_max: max number of jobs that can be submitted at the same time. Use 1 for running in serial
- job_time_hr: estimated hours that each job might takes (equivalent to
--time
Slurm option). Not required to change for running in serial - partition: name of the selected partition in the cluster (equivalent to
--partition
Slurm option). Not required to change for running in serial - query_type: type of the FIA query i.e. state, county, and/or coordinate
The workflow includes the following steps:
- Setup the environment
- Download required data by Bash
- Download FIA dataset by Bash, Python and Slurm
- Generate outputs and reports by Python
You can find the workflow in batch_file.sh
.
After updating config.json
, run the following to generate outputs:
## with Slurm
sbatch batch_file.sh
## In serial (without Slurm)
. ./batch_file.sh
When jobs are done, see the report of submitted jobs in report-<query-type>-*.out
and use python test_db.py
to extract the collected information for a coordinate, state or county. The application will generate the report when jobs finished, but in the case that the report is not created because of the job failures, use the following to generate the report:
## With Slurm
. ./environment.sh
sbatch report-<query-type>.sh `cat time_<query-type>`
## In serial (without Slurm)
. ./environment.sh
. ./report-<query-type>.sh > ./report-<query-type>-serial.out
You can Find warnings and failed jobs in:
./job-out-<query-type>/warning.txt
./job-out-<query-type>/failed.txt
And the collected databases are avalable under ./output
directory in JSON and CSV formats:
./outputs/<query-type>-panel-<date-time>.csv
./outputs/<query-type>-<date-time>.csv
./outputs/<query-type>-<date-time>.json
Note that job failures can be related to:
- Unavailability of FIADB (check FIA alerts at https://www.fia.fs.fed.us/tools-data/)
- Download failure
- Slurm job failure
- Invalid configs (review
config.json
) - Invalid coordinates (review
coordinate.csv
)
If failures are related to FIA servers, downloading JSON files or Slurm jobs, consider to resubmit the failed jobs by running:
. ./rebatch_file.sh
Otherwise, modify config file and/or input files and resubmit the batch_file.sh
.