Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health & Education solutions #23

Open
DentonGentry opened this issue Feb 8, 2020 · 11 comments · May be fixed by #208
Open

Health & Education solutions #23

DentonGentry opened this issue Feb 8, 2020 · 11 comments · May be fixed by #208
Assignees
Labels
Excel Import large-project This is a very large meta-issue, and will turn into a number of individual issues as it proceeds. Python Model

Comments

@DentonGentry
Copy link
Contributor

The models for the Family Planning and Educating Girls solutions are not separate Excel files but instead are constructed as an analysis built in to all of the other solutions. The difference in results from these two factors are computed for each of the other ~78 solutions, and summed.

This will need to be implemented in Python, likely in the same way by adding the handling to the model and computed for each of the other solutions.

@DentonGentry
Copy link
Contributor Author

The Health and Education part of the model is one of the integration steps: each individual solution produces its results, then the overall set of solutions is brought into integration to do a number of things:

  • some solutions are tradeoffs, like all of the energy solutions cannot exceed the total global demand for energy
  • some of the solutions provide feedstock for other solutions, like agricultural waste providing compost for regenerative agriculture and biomass power generation
  • a few of the solutions are modeled as part of the integration, like the Educating Girls and Family Planning solutions which make up the Health and Education sector. These two adjust the Total Addressable Market of most of the other solutions.

@DentonGentry DentonGentry changed the title Women & Girls solutions Health & Education solutions Jun 6, 2020
@Sunishchal
Copy link
Collaborator

Hi Denton, here are my notes from our call with Chad on Tuesday:

  • Each solution is a py module, there's a class named scanerio - upon instantiation it expects a string, but pass nothing and it takes firstscenario
  • Each solution has a member object named TAM: loop over each solution, get TAM for electricity
  • Use a different module to get emissions
  • Write the result out to files, since it will be expensive to run every time
  • Need to get UN population data - get a dump for now and worry about auto-refresh data later on
  • Multiply out everything above, then come up with a way to test the solution against Excel - will construct something similar to test_excel_integration.py
  • Will be about 6 weeks before we can test with all the 2020 solutions

https://population.un.org/wpp/Download/Standard/Population/
I believe this would be the UN data source?
Which file should be used? The total population of both sexes or the population growth rate data?

I found and studied the TAM module. I will work on generating and saving TAM for the following solutions (let me know if I'm missing any):
[airplanes, bikeinfrastructure, biochar, geothermal, solarpvroof, solarpvutil, offshorewind, onshorewind]

Is there any particular format these TAM data should be stored in?
Shoulld be capturing the 2014 TAM as is hardcoded into the solutions modules?
self.r2s = rrs.RRS(total_energy_demand=ref_tam_per_region.loc[2014, 'World']

Regarding emissions, should I dump it into the same CSV file as TAM or keep it separately?
Should I only be capturing CO2, or also CH4 and other gasses?
Would the self.c2 object contain all the data I need to capture?

Also curious, what does VMA stand for?

@DentonGentry
Copy link
Contributor Author

https://population.un.org/wpp/Download/Standard/Population/
I believe this would be the UN data source?
Which file should be used? The total population of both sexes or the population growth rate data?

For the current phase of the effort, where the Excel files are still being used and we need the Python implementation to match Excel, I think we don't want to download new data from the UN. Once we've retired the Excel files and are looking to the future, we'd work on downloading the freshest data. For now, I think we want to use the population data used in Excel. I think that is from the Unit Adoption Data tab, the tables starting at cells P17 and P69.

I will work on generating and saving TAM for the following solutions (let me know if I'm missing any)

To be honest I'd start with one. SolarPVUtility tends to work well, as it is a relatively straightforward solution.

Is there any particular format these TAM data should be stored in?

For two dimensional data we typically use CSV, as it is dramatically faster to load than Excel files.

If the file is one which a human researcher is likely to open and modify, we tend to prefer xlsx as it will preserve formatting, columns widths, and other things which make it more pleasant to work with.

For these files, generated automatically and used internally, I'd say CSV.

Should be capturing the 2014 TAM as is hardcoded into the solutions modules?

I think it will be the two dimensional data, with rows from 2012-2060 and columns for each Drawdown region.

Regarding emissions, should I dump it into the same CSV file as TAM or keep it separately?

I'd expect each would be a unique CSV file.

Should I only be capturing CO2, or also CH4 and other gasses?

Drawdown tends to use CO2eq for everything. CH4 and F-gasses are converted into the equivalent concentration of CO2.

Would the self.c2 object contain all the data I need to capture?

Yes, I think so.

Also curious, what does VMA stand for?

Variable Meta-Analysis. Multiple sources for a given data point, like the fuel efficiency of hybrid cars or the amount of CO2 which an acre of degraded land could sequester, are entered into the VMA row by row. It will then calculate a single value, typically the mean though the human researcher can apply judgement to adjust it.

@Sunishchal
Copy link
Collaborator

Hi Denton,

Sorry for the radio silence, had a big release at work. (Any tips on how to balance your time between drawdown & a tech job?)

I've tinkered with the solution modules and saved the tm.pds_tam_per_region() dataframe as a CSV from solarpvutil. I have some code that can do a similar task for any other solution. I also saved the population data as CSV from the Unit Adoption Data tab you mentioned.

I have also done the same for the c2.soln_pds_direct_co2_emissions_saved dataframe, but noticed the values are always zero. Am I looking at the right object? I couldn't find anything else within c2 that had more meaningful data. Also checked bikeinfrastructure and airplanes (which I imagine would have a fairly direct impact on CO2 savings?) and they are also zero.

While looking at this dataframe for airplanes, I noticed a strange pattern where only the World column has non-null values, except for the year 2018. Is this intentional? Figured it's worth bringing to your attention.
image

Now that I've got these CSVs saved, could you direct me on how to "Multiply out everything above, then come up with a way to test the solution against Excel - will construct something similar to test_excel_integration.py"?

I took a look at that .py script and it seems like the verify_tam_data() and verify_tam_data_eleven_sources() methods would be useful for my purposes? The part I'm a unclear on is how these TAM values will be constrained according to prospective population reduction scenarios.

@DentonGentry
Copy link
Contributor Author

DentonGentry commented Jun 30, 2020

I have also done the same for the c2.soln_pds_direct_co2_emissions_saved dataframe,
but noticed the values are always zero.

c2.co2_mmt_reduced() and c2.co2eq_mmt_reduced() are the best routines to call to get emissions reductions for a solution. The result is in millions of metric tons. co2_mmt_reduced() returns CO2 results, co2eq_mmt_reduced() includes methane and other GHGs by converting them to CO2-equivalents.

While looking at this dataframe for airplanes, I noticed a strange pattern where only the
World column has non-null values, except for the year 2018. Is this intentional?

Most of the energy solutions have good data at the regional level, because IEA and IANA publish extensive energy generation information. Many of the other solutions lack regional data, they only have data for the World.

In Excel, adoption data is often implemented via interpolation between two years, typically datapoints in 2018 and 2050. In the spreadsheet even when there is no regional adoption data, the 2018 year will be populated with zero. For tests/test_excel_integration.py to pass, we also populate 2018 with zero in Python but we make the rest of the years NaN to ensure that calculations involving the nonexistent regional data will not produce a result.

airplanes is one where the solutions doesn't have any data at the regional level, so the regional results are mostly NaN.

Now that I've got these CSVs saved, could you direct me on how to "Multiply out everything above, then come
up with a way to test the solution against Excel - will construct something similar to test_excel_integration.py"?

This is where we're in uncharted territory: the process where the researchers work out constraints between the solutions (i.e. don't add up to more than the total demand for energy) has not been implemented in Python and I'm not very familiar with how it is done. I know there is a separate spreadsheet where the researchers paste their results. I imagine we'd need to import that spreadsheet, and write a new test which compares results from the Python equivalent.

I took a look at that .py script and it seems like the verify_tam_data() and verify_tam_data_eleven_sources()
methods would be useful for my purposes? The part I'm a unclear on is how these TAM values will be
constrained according to prospective population reduction scenarios.

I imagine the tests we need to write would look similar, but probably not exactly verify_tam_data() and verify_tam_data_eleven_sources(). Those two routines test that the Python combining all of the sources for TAM data matches the result that Excel gets. There are two versions of it because some of the Excel files shifted their columns around to make room for more sources.

The test we need to write here can wouldn't test the TAM Data tab in the spreadsheets, as those are now inputs. It would test that the result, the new TAM with limits imposed by the other solutions, matches.

@Sunishchal
Copy link
Collaborator

Sunishchal commented Jul 4, 2020

I've written out my best understanding below, let me know if this process sounds correct:

  1. Capture the ratio between the ref and pds population scenarios (I believe I can divide data/unitadoption_pds_population.csv & data/unitadoption_ref_population.csv)
  2. Use the above ratio to scale down the TAM from each energy solution
  3. Verify that this new TAM we scaled down in python matches the values in the TAM Data sheet of each energy solution's Excel file (for world & all regions: test the linear, 2nd poly, 3rd poly, and exponential trends)
  4. Save the outputs of these TAM forecasts as files because they will be expensive operations

I know there is a separate spreadsheet where the researchers paste their results. I imagine we'd need to import that spreadsheet, and write a new test which compares results from the Python equivalent.

Is the separate spreadsheet you're referring to the "FamPlanning" tab in Excel? If so, how should I be consuming it?
image

In our call with Chad, you mentioned that I'll be building some infrastructure for what the research team considers the integration step. This will step through solutions, importing them, running them, etc. Could you elaborate on the scope & requirements of this?

I'll try to build it such that it doesn't require a lot of rework when we want to generalize it to other solution integration tasks. If I understand correctly, it should currently handle the 8 energy related solutions, but later will need to work for all the PDS solutions?

Anything else I should know before I start writing the code? Any details on how to work with the folder structure? Any particular places I should store certain files?

@DentonGentry
Copy link
Contributor Author

For step 3:

Verify that this new TAM we scaled down in python matches the values in the TAM Data sheet of each energy solution's Excel file (for world & all regions: test the linear, 2nd poly, 3rd poly, and exponential trends)

I think scaling down the TAM according to population sounds right, but I don't think it can be directly compared to the TAM Data tab in a test case. The TAM Data tab contains the unscaled, original TAM. tests/test_excel_integration.py currently compares the Python output to TAM Data, and passes, so we know that after scaling it won't match.

I think the test would instead read the "FamPlanning" tab from the spreadsheet, call equivalent routines in the Python code you're providing, and compare that they match. It may be that this can all be done as part of tests/test_excel_integration.py by adding another list of cells in the "FamPlanning" tab to verify[], but I'm not quite sure of that.

Is the separate spreadsheet you're referring to the "FamPlanning" tab in Excel? If so, how should I be consuming it?

I was referring to (at least one) entirely separate spreadsheet used during integration. The researchers paste CO2 emissions reductions and costs into a separate spreadsheet. I think the equivalent in Python would be tools to iterate through all solutions gathering the c2.co2eq_mmt_reduced() and other outputs. I think this CO2 emissions spreadsheet is probably not needed for the Health and Education solutions you're working on here, I think the FamPlanning tab in the individual solutions is likely to be the thing to use to write a test.

The "FamPlanning" tab appears to implement the handling of data/unitadoption_pds_population.csv vs data/unitadoption_ref_population.csv. If so, I think we'd want Python code which can produce a Pandas dataframe in the same format, allowing it to be compared in tests/test_excel_integration.py

In our call with Chad, you mentioned that I'll be building some infrastructure for what the research team considers the integration step. This will step through solutions, importing them, running them, etc. Could you elaborate on the scope & requirements of this?

I imagine that walking through the energy solutions to compute a new TAM and generate results would look like:

  1. read in data/overview/solutions.csv, iterate over all solutions where Sector == 'Electricity Generation'
  2. instantiate a Scenario object for each solution in Electricity Generation. You can omit the scenario name argument, as it will default to the first scenario which should be fine for this initial implementation. importlib is useful for this:
    importname = 'solution.' + solution  
    m = importlib.import_module(importname)  
    obj = m.Scenario()  
  1. This is where things get fuzzy for me: work out the modified TAMs based on population, and instantiate another Scenario object using the alternate TAM. We'll need to add a way to do this, currently the Scenario objects know where their TAMs are and don't have a way to specify an alternate. The energy solutions mostly use the same TAM data and we've consolidated it in data/energy with solution/rrs.py providing the list of sources to use. We'll need to find a way to instead use an alternate set of TAM files, ideally one which will work for more than just the energy solutions.

I'd recommend that at least for now, to add a new Python file in tools/.py for this integration code. We'll eventually add it as a GitHub Action to run every night, and have it send us a CL if the results have changed by someone checking in a change to one of these solutions during the previous day.

@Sunishchal
Copy link
Collaborator

Sunishchal commented Jul 14, 2020

I took a closer look at the FamPlanning tab and have a few follow up questions:

  1. Cell B8 says "Additional Functional Units in REF2 vs REF2". Is this a typo? Or is it comparing REF2s from two different sources?
  2. Cell C3 allows us to choose from various conventional energy sources to compare against. Will the python implementation need to test against values form each of these conventional sources?
  3. The table is blank from column H onward. What is the purpose of this part of the sheet? Is it a placeholder to be utilized later on?
  4. It hasn't quite "clicked" for me yet how the researchers use this portion of the Excel and its interplay with the rest of the sheets. I see the Advanced Controls and Unit Adoption Calculations sheets are referenced in the formulas, but I lack context around what these parts of the models do. If there is some documentation around all this, I'd love to study it. Perhaps @chadfrisch can add some color here from an end user perspective?

I think the test would instead read the "FamPlanning" tab from the spreadsheet, call equivalent routines in the Python code you're providing, and compare that they match. It may be that this can all be done as part of tests/test_excel_integration.py by adding another list of cells in the "FamPlanning" tab to verify[], but I'm not quite sure of that.

Would the list of cells I add to the test be B10:G56? I suppose E10:G56 will depend on the answer to #2 from above, since it's dynamic based on the conventional selection.
Could you point me to the script which instantiates this verify object? I imagine I would be able to find that in the test runner, but I've never used tox/make before so I'm unfamiliar with how to navigate this.

I was referring to (at least one) entirely separate spreadsheet used during integration.

Are you referring to the "Summary Family Planning Model File" that's referred to in cells R3:U5 in the FamPlanning tab? If so, would giving me access to those files help me complete this issue?

I'd recommend that at least for now, to add a new Python file in tools/.py for this integration code.

Great, thanks for jotting down those requirements. I will keep the design in mind as I build this out for SolarPVUtility.

@DentonGentry
Copy link
Contributor Author

Cell B8 says "Additional Functional Units in REF2 vs REF2". Is this a typo? Or is it comparing REF2s from two different sources?

The formulae refer to Unit Adoption Calculations cells in the Q19 range, and subtract the Q71 range. Those are labeled REF1 vs PDS population, so yes I'd assume the text in cell B8 is outdated or incorrect.

Cell C3 allows us to choose from various conventional energy sources to compare against. Will the python implementation need to test against values form each of these conventional sources?

I think we can start with just Coal, and get something working. I don't understand how the Family Planning outputs from each solution get combined into an overall Health and Education solution in the results. Non-energy solutions won't have 'Coal', they presumably will have something else.

The table is blank from column H onward. What is the purpose of this part of the sheet? Is it a placeholder to be utilized later on?

I don't know. I guess we ignore it for now and implement it if it becomes clear that it is important.

It hasn't quite "clicked" for me yet how the researchers use this portion of the Excel and its interplay with the
rest of the sheets. I see the Advanced Controls and Unit Adoption Calculations sheets are referenced in the
formulas, but I lack context around what these parts of the models do. If there is some documentation around
all this, I'd love to study it. Perhaps @chadfrisch can add some color here from an end user perspective?

There is an overview of the RRS solutions, like energy, in the Documentation directory.

I think the test would instead read the "FamPlanning" tab from the spreadsheet, call equivalent routines in
the Python code you're providing, and compare that they match. It may be that this can all be done as part of
tests/test_excel_integration.py by adding another list of cells in the "FamPlanning" tab to verify[], but I'm not
quite sure of that.

Yes, I think these should be verified in tests/test_excel_integration.py. That would be a good first step. Combining the FamPlanning results from all of the solutions to come up with a total can come later.

Would the list of cells I add to the test be B10:G56? I suppose E10:G56 will depend on the answer to #2 from above, since it's dynamic based on the conventional selection.

Yes, I'd say B10:G56.

If we do need to support more than just 'Coal' then we'll have to figure out how to do that. Right now, test_excel_integration uses an expected.zip file which contains the values from every sheet of the Excel file, for every scenario. We'd need to additionally store multiple FamPlanning sheets for the different conventional energy sources.

Could you point me to the script which instantiates this verify object? I imagine I would be able to find that in
the test runner, but I've never used tox/make before so I'm unfamiliar with how to navigate this.

tox -e ci will run the continuous integration test, including tests/test_excel_integration.
This takes a very long time, roughly two hours.

I was referring to (at least one) entirely separate spreadsheet used during integration.
Are you referring to the "Summary Family Planning Model File" that's referred to in cells R3:U5 in the
FamPlanning tab? If so, would giving me access to those files help me complete this issue?

I don't have that file, Chad probably does.

@Sunishchal
Copy link
Collaborator

I took a look through the CORE_PopulationChange_29Jan2020.xslx file that Chad emailed us a couple weeks ago, specifically the Electricity_cluster_lldc & Electricity_cluster_mdc tabs. It seems somewhat similar to the (now defunct) FamPlanning tab we were looking at earlier. The key differences I am spotting are:

  1. The electricity TAM is split between Least and Less Developed Countries (lldc) and More Developed Countries (mdc).
  2. The lldc TAM is split between two segments, found in row 22 of lldc. I interpret these to separate the least & less developed countries with higher (A) and lower (B) education attainment, though the names are hard to parse so please that validate my interpretation is correct.
  • A) Least and Less Developed Countries (sans LAC, EE, China, and countries with low educational attainment)
  • B) Least and Less Developed Countries (sans LAC, EE, China) with low educational attainment ONLY
  1. The "SOLUTION" cells (B26:F72) are all blank in the lldc sheet, even after I set the "Include in SOL?" cells (F3:14) to "Y" values. I presume this is by design because the less developed countries won't have these energy solutions implemented, so they only apply to the more developed countries?

The above begs the question: what would be the equivalent of cells to B10:G56 from FamPlanning which we earlier agreed would be what this new test reads in as validation data?
My guess would be B26:G72 from the Electricity_cluster_mdc tab, since there is no solution available for lldc.

Here is an exchange I had with @chadfrisch regarding some clarifications on these new Excel files:

Dev: At first glance, it seems like I'll want to use the "Electricity_cluster_lldc" and "Electricity_cluster_mdc" in lieu of the FamPlanning sheets from each of the individual energy solution workbooks. If I understand correctly, these sheets consolidate the TAM from all 8 energy solutions? If so, that should simplify the implementation from a python perspective.

Chad: Hi, yes, the electricity_cluster sheets use the common elec TAM for all grid solutions

Dev: I also don't think I'll need to use the WPP2015 file at this stage, because the population scenarios are already accounted for in the TAM from the CORE_PopulationChange file, correct? I think WPP2015 will be necessary to consider once we get to the point where we start refreshing the UN population data in an automated fashion (which I believe is still a ways out).

Chad: Correct re WPP2015. Not necessary now, but something on the horizon

Since the Electricity_cluster sheets consolidate all energy solutions, I don't believe I'll be able to start by writing a test for single solution, correct? It's seeming like the test will need to take the sum of TAM & CO2eq from the 8 energy solutions and validate them against the cells in this sheet.

Thanks for your patience with all the requirements clarifications this issue has called for. I'm certainly antsy to start writing some code soon... Assuming an affirmative response to all my questions above, I have a rough conceptual idea of what the implementation will look like. Would you like to schedule a 30 min sync this coming week to crystallize the approach?

@DentonGentry
Copy link
Contributor Author

I didn't initially have CORE_PopulationChange_29Jan2020.xslx (can't find any forwarding of it) but was able to obtain it today.

Since the Electricity_cluster sheets consolidate all energy solutions, I don't believe I'll be able to start
by writing a test for single solution, correct? It's seeming like the test will need to take the sum of
TAM & CO2eq from the 8 energy solutions and validate them against the cells in this sheet.

It seems so, yes. It will need to sum the TAM and emissions from all of the energy solutions.

I presume this is by design because the less developed countries won't have these energy solutions
implemented, so they only apply to the more developed countries?

I don't really know, however one guiding principal is that at this stage of the project is that the goal is to match the results of the model as it is, even in cases where we make a note to followup later of how things should be fixed or done differently. So I'd say that implementing LLDC as 0.0 for those fields would be best.

The above begs the question: what would be the equivalent of cells to B10:G56 from FamPlanning
which we earlier agreed would be what this new test reads in as validation data?

If B26:G72 from the Electricity_cluster_mdc seems reasonable, then by all means we can pursue that. I suspect that it will become more clear as we work on it.

Sunishchal added a commit to Sunishchal/solutions that referenced this issue Sep 6, 2020
@denised denised added the large-project This is a very large meta-issue, and will turn into a number of individual issues as it proceeds. label Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Excel Import large-project This is a very large meta-issue, and will turn into a number of individual issues as it proceeds. Python Model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants