Repo for DKSF Project with Safecast
-
Data Ambassadors: Edwin Zhang (edwin.james.zhang@gmail.com) & others
-
Data Exploration: First analyzed the Solarcast data. For information about data source and known issues with data, see the file:
exploration_notebooks/Issues_with_SolarcastDevice_data.md
. We’ve accommodated these known data discrepancies through our cleaning protocol (next bullet). -
Data Cleansing Protocol: See
exploration_notebooks/Solarcast_data_cleansing.md
-
Data Cleansing Protocol Base code:
data_cleaner.py
-
Automated Anomaly Detection: See
anomaly_detector.py
Here is a step-by-step guide to running the data_cleaner.py
and anomaly_detector.py
scripts:
- (recommended/optional) Create a new Python virtual environment for maximum virtual reproducibility. I recommend
conda create anomaly_detection python=3.8
; virtualenv is another option if you don’t have Anaconda installed. Then activate it:conda activate anomaly_detection
- Pull the most recent version of the repo:
git clone https://github.com/sakshamg94/safecast-unsupervised-anomaly-detection.git
and navigate to the directory - Install the requirements using the new requirements.txt file:
pip install -r requirements.txt
. At this point we should be set up to run thedata_cleaner.py
andanomaly_detector.py
scripts in the next two steps. data_cleaner
takes two inputs:start_yyyymm
, which is a string with format YYYY-MM for the earliest month for which data is available (e.g., 2017-09), andend_yyyymm
for the latest month for which data is available (e.g., 2020-07). It assumes that you have the raw data files in the folderraw_data
and that the raw data files are named like this example:output-2017-07-01T00_00_00+00_00
. To rundata_cleaner
, use this command:python3 -m data_cleaner 2017-09 2020-07
(feel free to replace with desired YYYY-MM dates). You should now see individual cleaned files in theprocessed_data
folder, as well as a file namedSolarcast-01_Main_Cleaned_Sorted.csv
. This last file is the one used byanomaly_detector
.anomaly_detector
takes two optional inputs: devices, a list of device numbers separated by spaces, and anomaly_types, a list of anomaly types to check for. One example command you might run would be:python anomaly_detector.py --devices 1660294046 3373827677 --anomaly_types negativeFields rollingMedianDev
. If you wanted to run the default of all devices and the following anomaly types (['negativeFields', 'rollingMedianDev', 'rollingStdevDev']
are the default;['dataContinuityCheck', 'PMorderingCheck', and 'nightDayDisparity']
are also available), you could also just runpython anomaly_detector.py
without specifying devices or anomaly_types. Running this script should create the outputAnomalies.csv
in theanomaly_data
directory, which contains information on the anomaly type, affected data field, time of capture, device, and a normalized severity score from 0 to 1. You'll need to create the folderfinal_anomaly_data
to write the final .csv.
Summarizing the commands from above:
conda create -n anomaly_detection python=3.8
conda activate anomaly_detection
git clone https://github.com/sakshamg94/safecast-unsupervised-anomaly-detection.git
cd safecast-unsupervised-anomaly-detection
pip install -r requirements.txt
python -m data_cleaner 2017-09 2020-07
mkdir final_anomaly_data
python anomaly_detector.py
Example outputs from running the above commands have been included in the cleaned_data
directory. However, two files you'd expect from the commands, a main file containing all the cleaned data (cleaned_data/Solarcast-01_Main_Cleaned_Sorted.csv
) and the final anomaly data (final_anomaly_data/anomalies.csv'
) have been omitted due to repo size constraints.
For ease of access, we've provided the data in raw_data
, which comes from Safecast and has already undergone some basic processing. To download the data and do this basic processing from scratch in the future (if one were updating this analysis for post-July 2020 data, for example), please refer to this Safecast open-source Ruby script. The set-up for this Ruby script can be found here with additional details linked for Windows and Linux/OS-X users. Edwin (contact email addresses at the top of this file) can also forward any questions to Safecast's development team.