Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial script for issue #23 health & education electricity cluster #208

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

Sunishchal
Copy link
Collaborator

Not intending to merge yet, just making my initial attempt visible to field Denton's feedback on the file structure & approach

@Sunishchal
Copy link
Collaborator Author

Questions on how to structure this solution:

  • Should "cluster" sheet have their own subdirectory within healthandeducation? (ex: healthandeducation/electricity_cluster/init.py)
  • Unit tests for each cluster can exist within each subdirectory (ex: healthandeducation/electricity_cluster/test_electricity.py)
  • Should each class still be called Scenario()? Or should they be named after their cluster?
  • Can the population tables or emissions factors be reused from elsewhere in the repo?

Please excuse poor code hygiene & long lines. I plan to format this before the final commit. Which autoformatter do you recommend, if any?

Copy link
Contributor

@DentonGentry DentonGentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can find a way for there to be just one solution/health_and_education/__init__.py file and one Scenario class, because for the overall system we'd like One Drawdown Solution == One Python Class.

We can have Python modules within solution/health_and_education/electricity and so on, and have solution/health_and_education/__init__.py import those modules, but ideally those submodules would not be visible outside of solution/health_and_education/__init__.py


Can the population tables or emissions factors be reused from elsewhere in the repo?

The other solutions use data/unitadoption_pds_population.csv and data/unitadoption_ref_population.csv. From what Chad mentioned in email, Health and Education has different data than the rest of the solutions currently use. I'm not clear on whether that will be the case forever, or is just a temporary condition that Health and Education has updated data which the rest of the solutions have not incorporated yet.

Depending on that answer we'll either want the population data to live in solution/health_and_education/data, or put it in the toplevel data directory with a name that makes it clear it is currently only used in some of the solutions.

THISDIR = pathlib.Path(__file__).parents[0]

name = 'Health and Education - Electricity Cluster'
# solution_category = ac.SOLUTION_CATEGORY.REDUCTION #TODO: Confirm this is a reduction solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Health and Education, like the other special solutions for Food Waste and Plant Rich Diet and so on, is neither a Reduction, Replacement, or Land Use solution. It is its own thing. We might want to add a SOLUTION_CATEGORY of SPECIAL to cover it.

If we find that each should be its own SOLUTION_CATEGORY we can add more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a reduction solution. It has a direct impact on the functional demand of... everything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to uncomment this line, are there any technical reasons REDUCTION would be incompatible with health & education?
I'm unfamiliar with the advanced_controls.py module, but I don't see SOLUTION_CATEGORY mentioned much in that script, so I think it would be safe to leave as REDUCTION.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be safe to leave as REDUCTION. I'd expected to have these be their own special category, but it doesn't have to.


# Population scenarios
# Population_Tables!C2:L49
self.ref1_population = pd.DataFrame(ref1_population_list[1:],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit tests in solution/health_and_education/tests, it would absolutely make sense to have things like ref1_population_list embedded in the solution/health_and_education/tests/something_test.py file. The list would most likely hold data which originated in the Excel file and is being used for a test.

For this actual solution file, I don't think embedding the data in the __init__.py file in the form of a ref1_population_list is a good idea. Our UI and workflows for researchers to add or adjust data will expect to be able to write out data files, not Python files. Recommend that these be added as CSV files in solution/health_and_education/data, and read them in using pd.read_csv().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. So any data that researchers interact with should be in CSV files (Population_Tables). But for intermediate tables, those should exist as lists in the unit test files (tables 3-14 in Electricity_cluster)?

What about Table 2 (REF 2 TAM)? That one should be coming from other electricity solutions, so I assume it's okay to leave in python lists for now until we identify how to source the electricity solution data from somewhere like solarpvutil.Scenario().tm.ref_tam_per_region()? This relates to question #2 from my email last week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for unit tests to use inline data.

I'd prefer that the actual solutions not do so, unless we have no other alternative for something. I don't expect the UI to be able to modify *.py files, anything inline in Python source will require someone with an editor to modify the Python source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to depart from the in line data and just copy paste the entire electricity_cluster Excel sheet into a CSV for unit testing: https://github.com/ProjectDrawdown/solutions/blob/395ed46819ea6f90bcb29b04c8f3bd03bd1a46a9/solution/health_and_education/tests/expected_elec_cluster.csv

Let me know if this approach is kosher. I saw test_excel_integration.py uses similar methods except the CSV lives inside a zip file. I can follow that approach here too if you prefer.

@@ -0,0 +1,447 @@
"""Health & Education solution model for Electricity Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthandeducation is difficult to understand if you don't already know what it is. I keep seeing the word "hand" in it. Recommend naming the subdirectory health_and_education instead.

Copy link
Collaborator Author

@Sunishchal Sunishchal Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I was initially following the same "no snake case" convention as the other solutions, but I think it makes sense to add underscores here. I doubt that "hand education" would be an effective climate solution ✋😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol... that is pretty good, Dev

DATADIR = pathlib.Path(__file__).parents[2].joinpath('data')
THISDIR = pathlib.Path(__file__).parents[0]

name = 'Health and Education - Electricity Cluster'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that solution/health_and_education can handle all of the clusters, such that this __init__.py would handle all of them. Its name would be 'Health and Education' without a specific cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the bulk of this code out to a new electricity_cluster.py file and simply instantiate the object in __init__py. Let me know if this is the structure you had in mind. We'll probably need slightly different abstractions later to handle different complexities as I implement more clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that will be fine. If necessary it can be put into electricity_cluster/__init__.py and imported as a module, if we find that there need to be more files encapsulated within electricity_cluster/*

# % impact of educational attainment on uptake of Family Planning:
fixed_weighting_factor = None
pct_impact = 0.50
use_fixed_weight = 'N'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these need to vary by scenario they'll need to be in advanced_controls.py instead of here. It is fine to have them here for now until we figure out whether they vary by scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted this down as a comment so I remember to revisit it later. For now, will keep them local to each cluster's .py file.

'Asia (Sans Japan)': 'Y',
'Middle East and Africa': 'Y',
'Latin America': 'N'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we'll need to revisit: we're trying very hard to minimize the number of places in the codebase which know the names of the regions, like 'Middle East and Africa'. Right now we've gotten it down to just model/dd.py and model/emissions_factors.py, which has a notion of how clean the electric grid is in each region.

The intent is to let this model implementation function at the level of the whole world, and also be able to support regionalization to model the United States or India or the EU as the top level and have regions within it which are sensible for that purpose.

It may be that lldc_high_nrr_config only makes sense at the level of the entire world, and that if modelling the US or India or the EU that all of the regions within would either be 'Y' or 'N'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I plan to add this lldc_high_nrr dictionary as a variable in dd.py (similar to dd.REGION)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be fine. It will need a good description of what it is, since it won't be as widely used as the other things in that file.

# Population_Tables!O2:X49
self.ref2_population = pd.DataFrame(ref2_population_list[1:],
columns=ref2_population_list[0],
index=list(range(2014, 2061)), dtype=np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the data come from a CSV file means one of the columns should be the Year, and use .set_index() on the DataFrame.

The danger of using range(2014, 2061) for the index is that the data is decoupled from the year. Early on we had a bunch of bugs dealing with shifting years out from under the data, and adopted a policy of keeping the year indexes together with the data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's safer to have an explicit Year column. I saw places in the codebase like test_tam.py which use the range(2014, 0261) method, so I mimicked that.
I've now abstracted this into the CSV file. I didn't use the .set_index() method you mentioned, but achieved the same effect by passing the index_col='Year' argument into pd.read_csv().

if lldc_high_nrr_config['Asia (Sans Japan)'] == 'N':
ref1_elec_gen.loc[:, 'MDC + EE + LAC with higher educational attainment'] = ref1_elec_gen.loc[:, 'MDC + EE + LAC with higher educational attainment'] - self.ref1_tam_high_edu.loc[:, 'China']

ref1_elec_gen.loc[:, 'Total Electricity Demand in Countries with Low Educational Attainment (TWh), exluding China'] = ref1_elec_gen.loc[:, 'LLDC with low educational attainment, excluding China'] + ref1_elec_gen.loc[:, 'MDC + EE + LAC with low educational attainment, excluding China']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big reason for wanting these to come from data files is that embedded within the Python code will make these more difficult to change as the researchers add data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean I should remove the Asia (Sans Japan) reference in line 136? How might I abstract that into a data file?

# Population_Tables!C2:L49
ref1_population_list = [
['World', 'OECD90', 'Eastern Europe', 'Asia (Sans Japan)', 'Middle East and Africa', 'Latin America', 'China', 'India', 'EU', 'USA'],
[7349.47210, 929.27447, 407.26543, 3957.23614, 1421.29910, 634.39696, 1376.04894, 1311.05053, 738.44207, 321.77363],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend these go into a data subdirectory within solution/health_and_education.

Note that there is a data/health directory at the toplevel which we should rename. That holds information about the health of the models, like how many use Custom PDS Adoption and whether they have regional data. At the time, there wasn't a Health and Education sector there was Women and Girls, so we didn't see the name conflict.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved them to CSVs inside solution/health_and_education/data.

Would you like me to rename the data/health directory, or were you just mentioning this for my understanding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mentioning, I wouldn't rename data/health in this same set of commits.

@Sunishchal
Copy link
Collaborator Author

I used many country/region references in the electricity_cluster implementation because I wanted it to be as explicit and similar to the Excel as possible for easy debugging. Shall I plan on changing the column naming convention or using something like .iloc[] which doesn't rely on strings of column names?

@Sunishchal
Copy link
Collaborator Author

Sunishchal commented Sep 14, 2020

I've added an initial stab at a unit testing file. I went the route of adding a huge CSV file with all of the intermediate tables from Chad's Excel file to use as the "expected df" (can turn this into a zip file later). Only added a single unit test so far. Let me know if this setup will work and I'll flesh out the rest of the tests.

I already see that the test_electricity.py file has failed the build. This is because I had to resort to a sys.path.append() to get my electricity_cluster module to import, since I was unable to get it to import naturally like the rest of the modules in this repo. Could you help me work around this?

@DentonGentry
Copy link
Contributor

Shall I plan on changing the column naming convention or using something like .iloc[] which doesn't rely on strings of column names?

iloc tends to be fragile, especially when using it to address rows. For example, if two researchers both add a line at the end of the file and one of them gets a merge conflict, the resolution will move the new row and any iloc[] trying to reference it will be off-by-one.

We've tried to mostly only use iloc[0], meaning "the first data point no matter what the year is."

@DentonGentry
Copy link
Contributor

This is because I had to resort to a sys.path.append() to get my electricity_cluster module to import, since I was unable to get it to import naturally like the rest of the modules in this repo. Could you help me work around this?

This may end up being a reason to add it as electricity_cluster/__init__.py instead of electricity_cluster.py, so it will import as a regular module.

@DentonGentry
Copy link
Contributor

Oops. Didn't intend to close it, reopened.

@Sunishchal
Copy link
Collaborator Author

Sunishchal commented Sep 26, 2020

This may end up being a reason to add it as electricity_cluster/init.py instead of electricity_cluster.py, so it will import as a regular module.

I made the requested change in this commit. However, I opted to name the directory "clusters" rather than "electricity_cluster" so it can house .py files from all the clusters and share the same init. Let me know if this structure is suitable.

However, I am still unable to import upper level modules like model.dd without having a sys.path.append, as you'll see in this commit. Is this okay to keep or is there some other convention we prefer to follow?

@Sunishchal
Copy link
Collaborator Author

iloc tends to be fragile, especially when using it to address rows. For example, if two researchers both add a line at the end of the file and one of them gets a merge conflict, the resolution will move the new row and any iloc[] trying to reference it will be off-by-one.

Good to know. In that case I will keep the verbose strings for column indexing. Point noted about using iloc in case I need to access the first row regardless of year.

@brodavi
Copy link
Collaborator

brodavi commented Jun 15, 2021

@Sunishchal is this PR still valid, or is it obsolete? I see some tests failed, is that something we should try to address?

@Sunishchal
Copy link
Collaborator Author

@Sunishchal is this PR still valid, or is it obsolete? I see some tests failed, is that something we should try to address?

@brodavi It's still valid but not ready to merge for a while until we figure out how to integrate this model with the rest of the codebase. I just had a chat with @denised and it sounds like we'll need to spend some time understanding how to do that since this model is quite different from other solutions.

@denised denised marked this pull request as draft June 16, 2021 05:22
@denised denised linked an issue Jun 26, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health & Education solutions
4 participants