Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid df/str copies and unneeded iterables #55

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

alexandermorgan
Copy link

@alexandermorgan alexandermorgan commented Mar 13, 2024

The most significant change here is in batch_unpack_from_supply_curve. I originally tried doing this:
sc_developable_df = sc_df[sc_df['capacity'] > 0].reset_index(drop=True)
That is fine but the runtime is the same. A significant improvement happens when we modify the sc_df in place. This looks safe since the function is only called from cli.py where the table is read in from a csv file and only used to feed into this batch_unpack_from_supply_curve function. So modifying it in place won't affect anything else. An increasing percentage of the runtime of this filter & reset_index work is saved as the df size increases. A further considerable improvement could be gained by setting pd.options.mode.copy_on_write = True which will be the default in Pandas 3 anyway. But that is too large of a change to make without being more familiar with this repository, so this pr does not change that Pandas option.

In to_wgs the del calls are not needed since the series are already being filtered on the self.df.column names anyway.

This change n_workers = min(cpu_count(), n_workers) was recommended by your linter.

And this call is also optionally routed through multiprocessing all_turbines_df['geometry'] = all_turbines_df.apply(

In characterizations.py, this .keys() is removed because iterating over a dictionary already iterates over the keys, so dict.keys() is almost never needed in Python. columns=[c for c in col_df.columns if c not in lkup.keys()]

col_df.rename( no need for the dictionary comprehension here since we can operate directly on the columns with col_df.columns += "_area_sq_km"

This creates 3 lists, 3 sets, and 1 dict_keys object, but none of these are necessary:
characterization_cols = list(characterization_remapper.keys()) df_cols = supply_curve_df.columns.tolist() cols_not_in_df = list(set(characterization_cols).difference(set(df_cols)))
Instead iterate of the dictionary and don't even make 1 list if it succeeds, and fail as soon as you hit a bad value if it fails:
if any(key not in df.columns for key in characterization_remapper):
Though in the failure case, you do have to make the list for the error message.

In functions.py, using df.rename without inplace=True currently copies the entire dataframe. And doing this inside a loop meant that several copies of the whole df could be made if multiple columns got renamed. So change df = df.rename({col: ncol}, axis=1) to keeping a list of columns going outside of the loop and only updating the column names if needed with df.columns = new_columns.

Finally insert an element into the zeroth position in a list will cause all of the elements to have to be updated. So replace values.insert(0, wkb) with an iterable unpacking version of making the values list: values = [wkb, *row.values]

@alexandermorgan
Copy link
Author

Related #9

@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2024

Codecov Report

Attention: Patch coverage is 52.94118% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 49.63%. Comparing base (cdd9594) to head (b922a5d).
Report is 134 commits behind head on main.

Files Patch % Lines
reView/utils/functions.py 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #55      +/-   ##
==========================================
+ Coverage   39.65%   49.63%   +9.97%     
==========================================
  Files          45       48       +3     
  Lines        3929     4745     +816     
  Branches      665      740      +75     
==========================================
+ Hits         1558     2355     +797     
+ Misses       2344     2340       -4     
- Partials       27       50      +23     
Flag Coverage Δ
unittests 49.63% <52.94%> (+9.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexandermorgan
Copy link
Author

The latest commit is a small change to read_timeseries that addresses the "too many statements" linter issue by moving a lot of the day, month, hour, etc. calculations out of the if/else statement.

@alexandermorgan
Copy link
Author

My apologies for this issue. With the most recent commit, the linter now returns 10/10 for all three files changed in this pr.

@mjgleason mjgleason self-requested a review April 9, 2024 18:39
Copy link
Collaborator

@mjgleason mjgleason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandermorgan Thanks for submitting this PR, and apologies for not reviewing it sooner - I didn't see it until late last week.

Do you have benchmarks on the runtime before and after these changes?

I did some quick benchmarks using pytest-benchmark and one of the tests test_cli.test_cli.test_unpack_turbines_happy. I used the existing test dataset (tests/data/bespoke-supply-curve.csv) which is pretty small (10 rows) and only saw a very small speedup (25-30ms). I also created a larger CSV (1000 rows) and in that case the updated code is a bit slower (but differences are in the noise). Regardless, these were both pretty small datasets and I didn't test the added parallelization, so I was curious if you'd done more thorough benchmarking.

Overall, the actual code changes look great. I did have one problem (see in-line comments on bespoke.py) that needs your attention before we can merge.

Thanks again!

rdf["latitude"] = lats
del rdf["x"]
del rdf["y"]
rdf[['longitude', 'latitude']] = transformer.transform(xs, ys)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line raises the following error for me: ValueError: Columns must be same length as key.

I'm not quite sure why this didn't cause the tests to fail. I've ensured that my versions of pandas and pyproj match those used in the Github Action Workflows, and this still fails.

Maybe we can just revert to:

lons, lats = transformer.transform(xs, ys)
rdf["longitude"] = lons
rdf["latitude"] = lats

I'd be surprised if this has any substantial effect on run time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there were two issues here. Referencing the columns as you did resolves one, and the other was the ValueError that you found. That was because I was counting on the following line rdf = rdf[self.df.columns] to remove the 'x' and 'y' columns. But that was a mistake and was causing the "Columns must be same length as key" issue. In the end, I also replaced the two del rdf['x'] (and 'y') calls with assignment of the values in your way, but to the 'x' and 'y' columns, then renaming those columns (in place) to be "longitude" and "latitude". It's a small difference, but I saw it amounted to about a 1.5% speedup.

@alexandermorgan
Copy link
Author

Thank you for the extra tests you did! I'm not sure why those passed earlier, but coming back to it now I was able to reproduce the same error you mentioned. Stepping back into the bespoke.py file, I noticed a way to refactor unpack_turbines to do a lot less individual copying and concatenating, and to assign values column-wise instead of element-wise. That was more significant resulting in a ~10-12% decrease in runtime.
I also went ahead and set the reset_index calls in model.py and element_builders.py to happen in place since they were overwriting the original dataframes anyway. Without inplace=True pandas will copy the entire dataframe. There's also a reset_index call in test_plots.py that doesn't have an inplace=True but in that case it can't be done in place since the index isn't dropped and that changes it from a series to a dataframe.
Regarding the earlier changes to characterizations.validate_characterization_remapper, those simplifications were more for style than for actual performance improvement. The changes would cause less work to be done, but only a tiny amount since it's just dealing with a few keys and things like that. This time around, however, I noticed that characterizations.unpack_characterizations was overwriting its return dataframe with a copy just before returning it. Throughout the function though, it changes that in_df in place making this seem like an unnecessary copy so I removed it.
As far as my benchmarking, I didn't do anything too elaborate, just calling the changed functions with the sample data you mentioned in a script and using timeit to measure before and after timings.

@mjgleason
Copy link
Collaborator

Thanks @alexandermorgan. I'll take another look, hopefully later this week. Will post my benchmarks on the latest code here as well, for posterity.

@mjgleason
Copy link
Collaborator

mjgleason commented May 10, 2024

Benchmark: test_unpack_turbines_happy

Summary: ~20% speedup from main

Benchmark settings:

benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)

Dataset size: 10 records

Branch: main

----------------------------------------------------- benchmark: 1 tests ----------------------------------------------------
Name (time in ms)                   Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     740.4823  808.7749  781.6769  31.2783  795.6267  55.1303       1;0  1.2793       5           1
-----------------------------------------------------------------------------------------------------------------------------

Branch: pr55

------------------------------------------------------ benchmark: 1 tests -----------------------------------------------------
Name (time in ms)                   Min       Max      Mean    StdDev    Median       IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     583.2502  829.1872  672.8237  102.7993  622.8418  151.2235       1;0  1.4863       5           1
-------------------------------------------------------------------------------------------------------------------------------

Dataset Size: 1000 records

Branch: main

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in s)                   Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     61.2281  64.9779  62.6157  1.6454  61.6155  2.5542       1;0  0.0160       5           1
-----------------------------------------------------------------------------------------------------------------------

Branch: pr55

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in s)                   Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_happy     48.2061  49.9964  49.0719  0.6842  48.9526  0.9799       2;0  0.0204       5           1
-----------------------------------------------------------------------------------------------------------------------

@mjgleason
Copy link
Collaborator

Benchmark 2: test_unpack_turbines_parallel

Summary: ~10% speedup

Benchmark settings:

benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)

Dataset Size: 1000 records

Branch: main

--------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in s)                      Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_parallel     32.3119  33.5078  32.8476  0.4892  32.8665  0.8010       2;0  0.0304       5           1
--------------------------------------------------------------------------------------------------------------------------

Branch: pr55

--------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in s)                      Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------
test_unpack_turbines_parallel     29.5965  30.8086  29.9385  0.5039  29.6873  0.5150       1;0  0.0334       5           1
--------------------------------------------------------------------------------------------------------------------------

Copy link
Collaborator

@mjgleason mjgleason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contributions @alexandermorgan . Based on the benchmarks I ran (posted in the comments above), the overall speed improvements are on the order of 10-20% for the unpack_turbines command.

@WilliamsTravis I've approved this PR. All tests are now passing, except for the one that was already failing on main.

I did not test the app manually, so you may want to do that before merging, but up to you.

Note that I do not know if this actually closes issue #9, which seems more focused on render performance than actual unpacking performance. Again, maybe that's something you could test manually in the app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants