Skip to content

Commit

Permalink
Merge pull request #342 from singularity-energy/development
Browse files Browse the repository at this point in the history
v0.3.1
  • Loading branch information
grgmiller authored Feb 10, 2024
2 parents 64f9a03 + 31e90bf commit d21db26
Show file tree
Hide file tree
Showing 19 changed files with 1,892 additions and 1,712 deletions.
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ authors:
identifiers:
- type: doi
value: 'https://zenodo.org/doi/10.5281/zenodo.7062459'
version: 0.3.0
version: 0.3.1
license: MIT
date-released: '2023-12-29'
date-released: '2024-02-13'
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ seaborn = "*"
sqlalchemy = "*"
statsmodels = "*"
coloredlogs = "*"
s3fs = {extras=["boto3"], versions="==2023.12.2"}
"catalystcoop.pudl" = {git = "git+https://github.com/singularity-energy/pudl.git@oge_release"}
gridemissions = {git = "git+https://github.com/singularity-energy/gridemissions"}

Expand Down
3,012 changes: 1,541 additions & 1,471 deletions Pipfile.lock

Large diffs are not rendered by default.

21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,22 @@ git clone https://github.com/singularity-energy/open-grid-emissions.git
cd open-grid-emissions
pipenv sync
pipenv shell
pip install build
python -m build
pip install .
```

The pipeline can be run as follows:
```bash
cd src
cd src/oge
python data_pipeline.py --year 2022
```
independently of the installation method you chose.

A more detailed walkthrough of these steps can be found below in the "Development Setup" section.

## Data Availability and Release Schedule
The latest release includes data for year 2019-2021 covering the contiguous United States, Alaska, and Hawaii. In future releases, we plan to expand the geographic coverage to additional U.S. territories (dependent on data availability), and to expand the historical coverage of the data.
The latest release includes data for year 2019-2022 covering the contiguous United States, Alaska, and Hawaii. In future releases, we plan to expand the geographic coverage to additional U.S. territories (dependent on data availability), and to expand the historical coverage of the data.

Parts of the input data used for the Open Grid Emissions dataset is released by the U.S. Energy Information Administration in the Autumn following the end of each year (2022 data was published in September 2023). Each release will include the most recent year of available data as well as updates of all previous available years based on any updates to the OGE methodology. All previous versions of the data will be archived on Zenodo.

Expand All @@ -61,6 +64,7 @@ There are many ways that you can contribute!
## Repository Structure
### Modules
- `column_checks`: functions that check that all data outputs have the correct column names
- `constants`: specifies conversion factors and constants used across all modules
- `data_pipeline`: main script for running the data pipeline from start to finish
- `download_data`: functions that download data from the internet
- `data_cleaning`: functions that clean loaded data
Expand All @@ -84,13 +88,24 @@ Notebooks are organized into five directories based on their purpose
- `work_in_progress`: temporary notebooks being used for development purposes on specific branches

### Data Structure
All manual reference tables are stored in `src/oge/reference_tables`.
All manual reference tables are stored in `src/oge/reference_tables`.

All files downloaded/created as part of the pipeline are stored in your HOME directory (e.g. users/user.name/):
- `HOME/open_grid_emissions_data/downloads` contains all files that are downloaded by functions in `load_data`
- `HOME/open_grid_emissions_data/outputs` contains intermediate outputs from the data pipeline... any files created by our code that are not final results
- `HOME/open_grid_emissions_data/results` contains all final output files that will be published

## Importing OGE as a Package in your Project
OGE is not yet available on PyPi but can be installed from GitHub. For example, this can be done by adding `oge = {git="https://github.com/singularity-energy/open-grid-emissions.git"}` to your Pipfile if you are using `pipenv` for your project.

Note that you don't need to run the pipeline to generate the output data as these are available on Amazon Simple Storage Service (S3). Simply, set the `OGE_DATA_STORE` environment variable to `s3` in the **\_\_init\_\_.py** file of your project to fetch OGE data from Amazon S3.
To summarize, your **\_\_init\_\_.py** file would then look like this:
```python
import os

os.environ["OGE_DATA_STORE"] = "s3"
```

## Development Setup
If you would like to run the code on your own computer and/or contribute updates to the code, the following steps can help get you started.

Expand Down
23 changes: 11 additions & 12 deletions docs/docs/Data Validation/Comparing Data to eGRID.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Although the OGE methodology is based on the EPA's eGRID methodology, there are
<td>
Method/approach
</td>
<td>eGRID2020
<td>eGRID2022
</td>
<td>Open Grid Emissions
</td>
Expand Down Expand Up @@ -56,16 +56,6 @@ Method/approach
<td>CO2 emissions for plants with a fuel cell prime mover are more than 800% higher than eGRID values.
</td>
</tr>
<tr>
<td>NOx emission factor for flared landfill gas
</td>
<td>0.02 lb NOx per MMbtu
</td>
<td>0.078 lb NOx per MMbtu
</td>
<td>Adjusted emissions from LFG will be lower than the eGRID values.
</td>
</tr>
<tr>
<td>CHP electric allocation factor
</td>
Expand Down Expand Up @@ -98,5 +88,14 @@ Uses MSW rather than MSN or MSB.
<td>Improves coverage of emissions data for these plants (there is no emission factor for OTH fuel, so these emissions would otherwise be zero)
</td>
</tr>
<
<tr>
<td>Global Warming Potential (GWP)
</td>
<td>Has used AR4 GWPs since eGRID2018 (still using AR4 as of eGRID2022)
</td>
<td>AR5 GWPs are used starting in 2019 (the earliest year of OGE data available, although they apply as far back as 2014). AR6 GWPs have been used since data year 2021.
</td>
<td>CO2-eq factors from eGRID will underestimate the GWP of CH4, and overestimate the GWP of N2O relative to the currently-recognized GWPs of these gases.
</td>
</tr>
</table>
3 changes: 1 addition & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,9 @@ channels:
- defaults
- conda-forge
dependencies:
- black # development: code formatting
- blas=*=openblas # prevent mkl implementation of blas
- cvxopt
- cvxpy=1.2.1 # used by gridemissions, newer version not working as of 12/12/2022
- flake8 # development: linter
- ipykernel
- nomkl # prevent mkl implementation of blas
- notebook
Expand All @@ -23,6 +21,7 @@ dependencies:
- qdldl-python==0.1.5,!=0.1.5.post2 # used for gridemissions, newer version not working as of 12/12/2022
- requests>=2.28.1
- ruff
- s3fs
- seaborn # used by gridemissions
- setuptools # used for pudl
- sqlalchemy
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ build-backend = "hatchling.build"

[project]
name = "oge"
version = "0.3.0"
version = "0.3.1"
requires-python = ">3.11"
readme = "README.md"
authors = [
Expand All @@ -32,6 +32,7 @@ dependencies = [
"sqlalchemy",
"statsmodels",
"coloredlogs",
"s3fs[boto3] == 2023.12.2",
"catalystcoop-pudl@git+https://github.com/singularity-energy/pudl.git@oge_release",
"gridemissions@git+https://github.com/singularity-energy/gridemissions.git",
]
Expand Down
4 changes: 1 addition & 3 deletions src/oge/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Set up the OGE logging configuration once.
import logging
from oge.logging_util import configure_root_logger
from oge.filepaths import outputs_folder

configure_root_logger(outputs_folder("logfile.txt"), logging.INFO)
configure_root_logger(logfile=None)
36 changes: 36 additions & 0 deletions src/oge/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# specify the energy_source_codes that are considered clean/carbon-free
CLEAN_FUELS = ["SUN", "MWH", "WND", "WAT", "WH", "PUR", "NUC"]

# specify the energy_source_codes that are considerd to be biomass
BIOMASS_FUELS = [
"AB",
"BG",
"BLQ",
"DG",
"LFG",
"MSB",
"OBG",
"OBL",
"OBS",
"SLW",
"WDL",
"WDS",
]

TIME_RESOLUTIONS = {"hourly": "H", "monthly": "M", "annual": "A"}

# derived from table 2.4-4 of the EPA's AP-42 document
nox_lb_per_mmbtu_flared_landfill_gas = 0.078

# values assumed by eGRID for CHP efficiency
chp_gross_thermal_output_efficiency = 0.8
chp_useful_thermal_output_efficiency = 0.75


class ConversionFactors(float):
"""Defines conversion factors between common units."""

lb_to_kg = 0.453592
mmbtu_to_GJ = 1.055056
mwh_to_mmbtu = 3.412142
short_ton_to_lbs = 2000
3 changes: 1 addition & 2 deletions src/oge/consumed.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@
from gridemissions.eia_api import KEYS, SRC
from oge.filepaths import outputs_folder, reference_table_folder, results_folder
from oge.logging_util import get_logger

from oge.constants import TIME_RESOLUTIONS
from oge.output_data import (
GENERATED_EMISSION_RATE_COLS,
CONSUMED_EMISSION_RATE_COLS,
output_to_results,
TIME_RESOLUTIONS,
)

logger = get_logger(__name__)
Expand Down
2 changes: 1 addition & 1 deletion src/oge/data_cleaning.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import oge.load_data as load_data
import oge.validation as validation
import oge.emissions as emissions
from oge.emissions import CLEAN_FUELS
from oge.constants import CLEAN_FUELS
from oge.column_checks import get_dtypes, apply_dtypes
from oge.filepaths import reference_table_folder, outputs_folder
from oge.logging_util import get_logger
Expand Down
17 changes: 8 additions & 9 deletions src/oge/data_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import oge.consumed as consumed
from oge.filepaths import downloads_folder, outputs_folder, results_folder
from oge.logging_util import get_logger, configure_root_logger
from oge.constants import TIME_RESOLUTIONS


def get_args() -> argparse.Namespace:
Expand Down Expand Up @@ -69,6 +70,11 @@ def print_args(args: argparse.Namespace, logger):

def main(args):
"""Runs the OGE data pipeline."""
if os.getenv("OGE_DATA_STORE") in ["s3", "2"]:
raise OSError(
"Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'"
)

args = get_args()
year = args.year

Expand All @@ -92,7 +98,7 @@ def main(args):
)
# Make results subfolders
for unit in ["us_units", "metric_units"]:
for time_resolution in output_data.TIME_RESOLUTIONS.keys():
for time_resolution in TIME_RESOLUTIONS.keys():
for subfolder in ["plant_data", "carbon_accounting", "power_sector_data"]:
os.makedirs(
results_folder(
Expand All @@ -118,14 +124,7 @@ def main(args):
# PUDL
download_data.download_pudl_data(source="aws")
# eGRID
# the 2019 and 2020 data appear to be hosted on different urls
egrid_files_to_download = [
"https://www.epa.gov/sites/default/files/2020-03/egrid2018_data_v2.xlsx",
"https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx",
"https://www.epa.gov/system/files/documents/2022-09/eGRID2020_Data_v2.xlsx",
"https://www.epa.gov/system/files/documents/2023-01/eGRID2021_data.xlsx",
]
download_data.download_egrid_files(egrid_files_to_download)
download_data.download_egrid_files()
# EIA-930
# for `small` run, we'll only clean 1 week, so need chalander file for making profiles
if args.small or args.flat:
Expand Down
21 changes: 13 additions & 8 deletions src/oge/download_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,18 +232,23 @@ def download_chalendar_files():
)


def download_egrid_files(urls_to_download: list[str]):
def download_egrid_files():
"""
Downloads the egrid excel files.
Inputs:
`urls_to_download`: a list of urls for the excel files that you want to download
Downloads the egrid excel files from 2018-2022.
"""
os.makedirs(downloads_folder("egrid"), exist_ok=True)

for url in urls_to_download:
filename = url.split("/")[-1]
filepath = downloads_folder(f"egrid/{filename}")
# the 2018 and 2019 data are on a different directory than the newer files.
egrid_urls = {
2018: "https://www.epa.gov/sites/default/files/2020-03/egrid2018_data_v2.xlsx",
2019: "https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx",
2020: "https://www.epa.gov/system/files/documents/2022-09/eGRID2020_Data_v2.xlsx",
2021: "https://www.epa.gov/system/files/documents/2023-01/eGRID2021_data.xlsx",
2022: "https://www.epa.gov/system/files/documents/2024-01/egrid2022_data.xlsx",
}

for year, url in egrid_urls.items():
filepath = downloads_folder(f"egrid/egrid{year}_data.xlsx")
download_helper(url, filepath)


Expand Down
5 changes: 1 addition & 4 deletions src/oge/eia930.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import re
from datetime import timedelta
import os
from os.path import join

import oge.load_data as load_data
from oge.column_checks import get_dtypes
Expand Down Expand Up @@ -152,9 +151,7 @@ def clean_930(year: int, small: bool = False, path_prefix: str = ""):
# Adjust
logger.info("Adjusting EIA-930 time stamps")
df = manual_930_adjust(df)
df.to_csv(
join(data_folder, "eia930_raw.csv")
) # Will be read by gridemissions workflow
df.to_csv(data_folder + "eia930_raw.csv") # Will be read by gridemissions workflow

# Run cleaning
logger.info("Running physics-based data cleaning")
Expand Down
Loading

0 comments on commit d21db26

Please sign in to comment.