Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

J007X · 2022-11-15T07:27:28Z

Describe the bug
In an effort to measure speed of the newly changed DataPack related code as described in issue 805( #805) and find out possible places to improve, a profiling test was created and helped to identify some places in the current code base (related to DataPack , SortedList) to seemingly make excessive calls, (in some cases 20x more than in 0.2.0 code base according to cProfile report,) which rendering significant slowing down of the current running result on certain (typical) scenarios. This should be able to be fixed by using related generator/list differently / more efficiently to reduce excessive calls.

To Reproduce
Steps to reproduce the behavior:

Download 0.2.0 code base to a new folder
Write a profiling (unit) test using typical forte pipleline to process (such as annotations) , or simply check out "data_pack_profiling_test.py" in the branch for implementing issue 805 profiling task (https://github.com/J007X/forte/tree/implementation_805)
Running the same profiling test on both current code base and 0.2.0 code base. Compare the running speed between running this profile test

Expected behavior
Notice the performance difference (currently significantly slower), and the excessive calls in the current codebase compare to 0.2.0 running result.
Expected desirable result (goal): the current version should be on par or faster than 0.2.0 in those typical scenarios.

Screenshots

Environment (please complete the following information):

OS: MacOS
Version 12.0.1

Additional context
Per discussion prior to creating this issue, SortedList itself is not the problem here as it was also used in 0.2.0 and not changed, so it should be some changes in the new code that accidentally rendering some methods such as sort being called excessively.

J007X · 2022-11-16T12:22:26Z

Analysis of the possible causes: Profiling test points out the slow down seems to related to the "get()" method of DataPack causing annotation (a Sortedlist) to have excessive calls (to sorting and other operations), after some more research and experiment this now can be narrow down to line 1602 of the current code of "DataPack.py" which (using the iterator as a list to) check for empty.
if len(self.annotations) == 0
This line is actually not changed from 0.2.0 but because of the underlying mechanism for self.annotations is completely changed (from list to generator) so this seeming innocent call in new code base actually causing SortedList being constructed for the internal iterator (self.all_annotations), and then get the length. Unfortunately this is not called in a "one-off" manner but repeated many times so large amount of excessive calls were made, including sorting which is totally unnecessarily in this case.

J007X · 2022-11-16T12:30:47Z

Solution options: (multiple)

a very conservative change can be made to this line is like:
if len(list(self.all_annotations)) == 0
basically using the list() to wrap the internal "all_annotations" instead if default wrapping using "SortedList", this is guaranteed to be fully compatible to all previous usages/use cases, and the speed up is over 300%
However after further researching the code of get() for DataPack it seemed this area of old 0.2.0 can have more changes, for example if we simply remove this "if" block for returning empty iterators for annotations and audio_annotations, This should still work at least in most cases and it did passed several related unit tests that I checked.
By this simple change since we did not iterate through elements so the speedup can be further increased to over 760%: (only take 18.8 seconds to finish vs. original 145 seconds)
The only drawback of this fixing seems to be in theory sometimes the get() might need to return "empty iterator" (However this might be some old requirement/use cases when using list to implement and not nested iterator). But since currently there might not be some good way to check generator style iterator for empty without risking losing the first element (when not empty) in a chained/nested generator/iterator scenario like ours, I think we might want to review/re-consider this special "returning empty iterator" use case (vs. some simpler implementation such as to catch exceptions from the caller side or outside of the iterator)

J007X self-assigned this Nov 15, 2022

This was referenced Nov 30, 2022

Performance improvement 904 #907

Merged

Performance improvement (for new Datapack changes): Improve class loading performance (based on profiling analysis on new use of get_class) #908

Closed

hunterhector closed this as completed in #907 Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

J007X commented Nov 15, 2022

J007X commented Nov 16, 2022

J007X commented Nov 16, 2022

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

Comments

J007X commented Nov 15, 2022

J007X commented Nov 16, 2022

J007X commented Nov 16, 2022