Skip to content

Latest commit

 

History

History
161 lines (123 loc) · 9.12 KB

README.md

File metadata and controls

161 lines (123 loc) · 9.12 KB

The Creation of US Power Plants NAIP/LANDSAT8 Dataset

This repo is a demonstration of dataset creation. The US Power Plants NAIP/LANDSAT8 Dataset is focused on US mainland power plants, providing both high-resolution (1m) and medium-resolution (15m) imagery for detection/segmentation tasks. Data sources:

1   Data Explanation

Items in Italics are primary ingredients for dataset construction; in bold are respectively images, labels, metadata, and sample code for image segmentation/pixel-wise classification. There are also some useful scripts for making and testing this dataset. Starred* Items are outputs to expect as the results of dataset construction.

  1. /uspp_naip: high-resolution power plant images (~1115x1115 pix, 5M/ea), used for gathering annotations;

  2. /uspp_landsat: medium-resolution power plant images (~75x75 pix, 70K/ea), to be used for classification

  3. /annotations*: confidence and binary masks denoting the outline of power plants. Meaning of pixel values in sub-folders:
    a) accepted_ann_json.txt: accepted annotations collected from Amazon Mechanical Turk, in JSON text;
    b) /confidence: confidence maps, at each pixel the value equals the number of annotators labeling it as part of a power plant
    c) /binary: binary mask with each pixel denoting whether or not more than half of all annotators agree that it is part of a power plant

  4. /exceptions*: instances with no valid annotations (most likely no visible power plants; or in a very small chance, all of three annotations were rejected);

  5. uspp_metadata.geogson*: geographic location, unique egrid id, plant name, state and county name, primary fuel, fossil fuel category, capacity factor, nameplate capacity, and CO2 emission data;
    Visualization available here: https://github.com/bl166/USPowerPlantDataset/blob/master/uspp_metadata.geojson;

  6. egrid2014_data_v2_PLNT14.xlsx: A subset of the Egrid document which contains US power plant locations and other information;

  7. cropPowerPlants(.py): exports satellite imagery from Google Earth Engine;

  8. fixLs(.m): preprocesses the Landsat imagery, including intensity stretch and gamma correction;

  9. getAllAcceptedCondensed(.py): generate a condensed annotation file from all accepted annotations with each image taking one line (NOTE: This script is NOT runable unless you have all accepted annotations, but not to worry because we have provided its output as accepted_ann_json.txt);

  10. make(.py): constructs the dataset;

  11. report(.py): generates a pie chart showing the data categorized by fuel type;

  12. classify_sample(.py): tests a simple segmentation task (pixel-wise classification) on this dataset.

2   Dataset Construction

2.0 Overview

This dataset was constructed in three phases:

  • P1DATAPREP (data preparation) - Download satellite imagery;
  • P2ANNOGEN (annotations generation) - Gather annotations of power plants;
  • P3DATAPROC (dataset processing) - Merge accepted annotations, create binary labels, and compile metadata.
  • P4TESTCLSFR (optional, test classifier) - Image segmentation by pixel-based classification.

2.1 Satellite Imagery Download

Dependencies

  • Python 2.X (for exporting data)
  • Python API for Google Earth Engine
  • Packages: ee, numpy, xlrd

Code & documentation

https://github.com/bl166/USPowerPlantDataset/blob/master/P1DATAPREP_cropPowerPlants.py

Steps

  1. Sign up for Google Earth Engine. To export data you must sign up as a developer.
  2. Install the Python API. Follow instructions in the link.
  3. In cropPowerPlants.py, on line#100 and 101 define your indices, from which collection to export, and in what order you want the exporting to take place.
if __name__ == '__main__':
	id_start,id_end = (300,500) # include id_start, exclude id_end
	download_ppt_pic(id_start,id_end,order='descend',collection='naip')
  1. Run the script, and in Google Earth Engine code editor, right columns -> task -> you can monitor the tasks here.
# activate whatever environment you may have installed for running the earthengine
$ source activate YOUR_ENVS
$ python P1DATAPREP_cropPowerPlants.py

After the exporting finishes, these cropped images should be in your Google Drive (Keep an eye on your storage. Download and clear it up regularly. Tasks will fail if there's no enough space in your drive). 5. Download images into /uspp_naip and /uspp_landsat EXACTLY.

Summary

  • Input (in the dataset root dir - same directory as the script): egrid2014_data_v2_PLNT14.xlsx
  • Output (to the dataset root dir): /uspp_naip and/or /uspp_landsat

2.2 Gather Annotations

See MTurkAnnotationTool: https://github.com/tn74/MTurkAnnotationTool.

2.3 Binary Labels Creation

NOTE: To try this section yourself, please remove all of 4 folders and the geojson file. Download the raw data here. Extract all items to this repo. Then follow the steps below:

Dependencies

  • Python 3.X
  • Packages: os, sys, json, numpy, PIL, xlrd

Code & documentation

https://github.com/bl166/USPowerPlantDataset/blob/master/P3DATAPROC_make.py

Steps

  • 1. Items that you should already have before running the script:
  • /uspp_naip: NAIP data with unprocessed images' names being ID.tif;
  • egrid2014_data_v2_PLNT14.xslx: the original metadata from which we read locations and cropped those power plants out;
  • accepted_ann_json.txt: annotations from the MTurkers.

Note: Items mentioned above should be directly under the root directory and named exactly as quoted. Otherwise the construction will fail.

  • (Optional, but strongly recommended!) /uspp_landsat: Landsat8 data with unprocessed images' names being ID.tif.

NOTE: If you do not have this folder, annotations will still be generated.

  • 2. Preprocessing the Landsat8 data.
    You can do this after the dataset is constructed, but we recommend that you do it beforehand.
$ matlab -nodisplay -r fixLs
  • 3. Run this script.
    While the program is running, you can expect some message showing the current status.
$ python P3DATAPROC_make.py
  • 4. Outputs:
    After the program finishes, you should find the following items shown up/changed in the root directory:
  • A new folder called /annotations, in which
    • /confidence has all annotated polygons converted into binary polygons masks and added up;
    • /binary has confidence masks binarized by max voting.

NOTE: The binary values are 0 and 255, therefore you should normalize it to 0 and 1 at the actual practice.

  • Images in uspp_naip (and uspp_landsat if applicable) that can be corresponded to the "accepted_ann_json.txt" are renamed (if the annotation is found as a valid power plants) or moved to /exceptions (if annotations contain only empty content).

NOTE (new name convention): DataType_egridUniqueID_State_Type.tif

  • Finally, a new file named uspp_metadata.geojson is generated. It contains all annotated power plants' metadata.

  • 5. In case that the process is interrupted, you can re-run it at the spot.
    All images that are already processed will NOT be revisited; new power plants will be added to the end of the metadata.

Summary

  • Input: /uspp_naip, accepted_ann_json.txt, /uspp_landsat (optional), /uspp_metadata.geojson (optional)
  • Output: /annotations, /exceptions, uspp_metadata.geogson

3   Test the Dataset

Dependencies

  • Python 3.X
  • Packages: sklearn, matplotlib, scipy, PIL, json, re, os, sys

This code is designed for pixel-based image segmentation. It looks at the window centered at each pixel and decides whether or not this pixel belongs to the object of interest.

Code & documentation

https://github.com/bl166/USPowerPlantDataset/blob/master/P4TESTCLSFR_classify_sample.py

$ python classify_sample.py

3.1 Cross-validation Results

3.2 Test on Specific Cases

Developers

  • Ben Brigman
  • Gouttham Chandrasekar
  • Shamikh Hossain
  • Boning Li
  • Trishul Nagenalli

Project: Detecting Electricity Access from Aerial Imagery, Duke Data+ 2017