This repo is a demonstration of dataset creation. The US Power Plants NAIP/LANDSAT8 Dataset is focused on US mainland power plants, providing both high-resolution (1m) and medium-resolution (15m) imagery for detection/segmentation tasks. Data sources:
- NAIP for high-resolution imagery
- Landsat8 for medium-resolution imagery
- EPA EGRID documents for latitude and longitude locations
Items in Italics are primary ingredients for dataset construction; in bold are respectively images, labels, metadata, and sample code for image segmentation/pixel-wise classification. There are also some useful scripts
for making and testing this dataset. Starred* Items are outputs to expect as the results of dataset construction.
-
/uspp_naip: high-resolution power plant images (~1115x1115 pix, 5M/ea), used for gathering annotations;
-
/uspp_landsat: medium-resolution power plant images (~75x75 pix, 70K/ea), to be used for classification
-
/annotations*: confidence and binary masks denoting the outline of power plants. Meaning of pixel values in sub-folders:
a) accepted_ann_json.txt: accepted annotations collected from Amazon Mechanical Turk, in JSON text;
b) /confidence: confidence maps, at each pixel the value equals the number of annotators labeling it as part of a power plant
c) /binary: binary mask with each pixel denoting whether or not more than half of all annotators agree that it is part of a power plant -
/exceptions*: instances with no valid annotations (most likely no visible power plants; or in a very small chance, all of three annotations were rejected);
-
uspp_metadata.geogson*: geographic location, unique egrid id, plant name, state and county name, primary fuel, fossil fuel category, capacity factor, nameplate capacity, and CO2 emission data;
Visualization available here: https://github.com/bl166/USPowerPlantDataset/blob/master/uspp_metadata.geojson; -
egrid2014_data_v2_PLNT14.xlsx: A subset of the Egrid document which contains US power plant locations and other information;
-
cropPowerPlants(.py)
: exports satellite imagery from Google Earth Engine; -
fixLs(.m)
: preprocesses the Landsat imagery, including intensity stretch and gamma correction; -
getAllAcceptedCondensed(.py)
: generate a condensed annotation file from all accepted annotations with each image taking one line (NOTE: This script is NOT runable unless you have all accepted annotations, but not to worry because we have provided its output as accepted_ann_json.txt); -
make(.py)
: constructs the dataset; -
report(.py)
: generates a pie chart showing the data categorized by fuel type; -
classify_sample(.py)
: tests a simple segmentation task (pixel-wise classification) on this dataset.
This dataset was constructed in three phases:
- P1DATAPREP (data preparation) - Download satellite imagery;
- P2ANNOGEN (annotations generation) - Gather annotations of power plants;
- P3DATAPROC (dataset processing) - Merge accepted annotations, create binary labels, and compile metadata.
- P4TESTCLSFR (optional, test classifier) - Image segmentation by pixel-based classification.
Python 2.X
(for exporting data)- Python API for Google Earth Engine
- Packages:
ee, numpy, xlrd
https://github.com/bl166/USPowerPlantDataset/blob/master/P1DATAPREP_cropPowerPlants.py
- Sign up for Google Earth Engine. To export data you must sign up as a developer.
- Install the Python API. Follow instructions in the link.
- In cropPowerPlants.py, on line#100 and 101 define your indices, from which collection to export, and in what order you want the exporting to take place.
if __name__ == '__main__':
id_start,id_end = (300,500) # include id_start, exclude id_end
download_ppt_pic(id_start,id_end,order='descend',collection='naip')
- Run the script, and in Google Earth Engine code editor, right columns -> task -> you can monitor the tasks here.
# activate whatever environment you may have installed for running the earthengine
$ source activate YOUR_ENVS
$ python P1DATAPREP_cropPowerPlants.py
After the exporting finishes, these cropped images should be in your Google Drive (Keep an eye on your storage. Download and clear it up regularly. Tasks will fail if there's no enough space in your drive).
5. Download images into /uspp_naip
and /uspp_landsat
EXACTLY.
- Input (in the dataset root dir - same directory as the script): egrid2014_data_v2_PLNT14.xlsx
- Output (to the dataset root dir): /uspp_naip and/or /uspp_landsat
See MTurkAnnotationTool: https://github.com/tn74/MTurkAnnotationTool.
NOTE: To try this section yourself, please remove all of 4 folders and the geojson file. Download the raw data here. Extract all items to this repo. Then follow the steps below:
Python 3.X
- Packages:
os, sys, json, numpy, PIL, xlrd
https://github.com/bl166/USPowerPlantDataset/blob/master/P3DATAPROC_make.py
- 1. Items that you should already have before running the script:
/uspp_naip
: NAIP data with unprocessed images' names being ID.tif;egrid2014_data_v2_PLNT14.xslx
: the original metadata from which we read locations and cropped those power plants out;accepted_ann_json.txt
: annotations from the MTurkers.
Note: Items mentioned above should be directly under the root directory and named exactly as quoted. Otherwise the construction will fail.
- (Optional, but strongly recommended!)
/uspp_landsat
: Landsat8 data with unprocessed images' names being ID.tif.
NOTE: If you do not have this folder, annotations will still be generated.
- 2. Preprocessing the Landsat8 data.
You can do this after the dataset is constructed, but we recommend that you do it beforehand.
$ matlab -nodisplay -r fixLs
- 3. Run this script.
While the program is running, you can expect some message showing the current status.
$ python P3DATAPROC_make.py
- 4. Outputs:
After the program finishes, you should find the following items shown up/changed in the root directory: - A new folder called
/annotations
, in which/confidence
has all annotated polygons converted into binary polygons masks and added up;/binary
has confidence masks binarized by max voting.
NOTE: The binary values are 0 and 255, therefore you should normalize it to 0 and 1 at the actual practice.
- Images in
uspp_naip
(anduspp_landsat
if applicable) that can be corresponded to the "accepted_ann_json.txt" are renamed (if the annotation is found as a valid power plants) or moved to/exceptions
(if annotations contain only empty content).
NOTE (new name convention): DataType_egridUniqueID_State_Type.tif
-
Finally, a new file named
uspp_metadata.geojson
is generated. It contains all annotated power plants' metadata. -
5. In case that the process is interrupted, you can re-run it at the spot.
All images that are already processed will NOT be revisited; new power plants will be added to the end of the metadata.
- Input: /uspp_naip, accepted_ann_json.txt, /uspp_landsat (optional), /uspp_metadata.geojson (optional)
- Output: /annotations, /exceptions, uspp_metadata.geogson
Python 3.X
- Packages:
sklearn, matplotlib, scipy, PIL, json, re, os, sys
This code is designed for pixel-based image segmentation. It looks at the window centered at each pixel and decides whether or not this pixel belongs to the object of interest.
https://github.com/bl166/USPowerPlantDataset/blob/master/P4TESTCLSFR_classify_sample.py
$ python classify_sample.py
- Ben Brigman
- Gouttham Chandrasekar
- Shamikh Hossain
- Boning Li
- Trishul Nagenalli
Project: Detecting Electricity Access from Aerial Imagery, Duke Data+ 2017