Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

Closed
J007X opened this issue Nov 15, 2022 · 2 comments · Fixed by #907
Assignees

Comments

@J007X
Copy link
Collaborator

J007X commented Nov 15, 2022

Describe the bug
In an effort to measure speed of the newly changed DataPack related code as described in issue 805( #805) and find out possible places to improve, a profiling test was created and helped to identify some places in the current code base (related to DataPack , SortedList) to seemingly make excessive calls, (in some cases 20x more than in 0.2.0 code base according to cProfile report,) which rendering significant slowing down of the current running result on certain (typical) scenarios. This should be able to be fixed by using related generator/list differently / more efficiently to reduce excessive calls.

To Reproduce
Steps to reproduce the behavior:

  1. Download 0.2.0 code base to a new folder
  2. Write a profiling (unit) test using typical forte pipleline to process (such as annotations) , or simply check out "data_pack_profiling_test.py" in the branch for implementing issue 805 profiling task (https://github.com/J007X/forte/tree/implementation_805)
  3. Running the same profiling test on both current code base and 0.2.0 code base. Compare the running speed between running this profile test

Expected behavior
Notice the performance difference (currently significantly slower), and the excessive calls in the current codebase compare to 0.2.0 running result.
Expected desirable result (goal): the current version should be on par or faster than 0.2.0 in those typical scenarios.

Screenshots

Environment (please complete the following information):

  • OS: MacOS
  • Version 12.0.1

Additional context
Per discussion prior to creating this issue, SortedList itself is not the problem here as it was also used in 0.2.0 and not changed, so it should be some changes in the new code that accidentally rendering some methods such as sort being called excessively.

@J007X J007X self-assigned this Nov 15, 2022
@J007X
Copy link
Collaborator Author

J007X commented Nov 16, 2022

Analysis of the possible causes: Profiling test points out the slow down seems to related to the "get()" method of DataPack causing annotation (a Sortedlist) to have excessive calls (to sorting and other operations), after some more research and experiment this now can be narrow down to line 1602 of the current code of "DataPack.py" which (using the iterator as a list to) check for empty.
if len(self.annotations) == 0
This line is actually not changed from 0.2.0 but because of the underlying mechanism for self.annotations is completely changed (from list to generator) so this seeming innocent call in new code base actually causing SortedList being constructed for the internal iterator (self.all_annotations), and then get the length. Unfortunately this is not called in a "one-off" manner but repeated many times so large amount of excessive calls were made, including sorting which is totally unnecessarily in this case.

@J007X
Copy link
Collaborator Author

J007X commented Nov 16, 2022

Solution options: (multiple)

  1. a very conservative change can be made to this line is like:
    if len(list(self.all_annotations)) == 0
    basically using the list() to wrap the internal "all_annotations" instead if default wrapping using "SortedList", this is guaranteed to be fully compatible to all previous usages/use cases, and the speed up is over 300%

  2. However after further researching the code of get() for DataPack it seemed this area of old 0.2.0 can have more changes, for example if we simply remove this "if" block for returning empty iterators for annotations and audio_annotations, This should still work at least in most cases and it did passed several related unit tests that I checked.
    By this simple change since we did not iterate through elements so the speedup can be further increased to over 760%: (only take 18.8 seconds to finish vs. original 145 seconds)
    The only drawback of this fixing seems to be in theory sometimes the get() might need to return "empty iterator" (However this might be some old requirement/use cases when using list to implement and not nested iterator). But since currently there might not be some good way to check generator style iterator for empty without risking losing the first element (when not empty) in a chained/nested generator/iterator scenario like ours, I think we might want to review/re-consider this special "returning empty iterator" use case (vs. some simpler implementation such as to catch exceptions from the caller side or outside of the iterator)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant