Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve complex forms performance #740

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Nov 30, 2024

Relevant to #739 and significantly improves performance and memory usage, but does not fully resolve the #739 in particular the recursion problem with relative references in repeats. However the changeset is already broad in scope and a large performance improvement so I would like to merge before other PRs come in.

Why is this the best possible solution? Were any other approaches considered?

The problem form mentioned in #739 now converts in about 8 seconds (excl. 6s for Validate) rather than about 5 minutes. In terms of memory usage, from a pre-conversion baseline of 35-40MB (resident/RSS), usage peaks at about 110MB rather than about 145MB.

Here is an outline of the pyxform conversion pipeline and where notable changes were made:

  1. xls2xform.py: CLI or API request to read from a file or memory buffer
  2. xls2json_backends.py: convert to rows of dict from rectangular data source (XLS/X, CSV, MD)
    1. OrderedDict was being used to support <py3.7 but now dict retains order and uses half the memory per object (64B vs 128B). So if the input XLSForm has 10k rows of questions/choices, that's 640KB difference.
  3. xls2json.py: convert rows of dict to untyped dict tree structure
    1. Move validate_choices to the validators module so the relevant data is in scope for as short time as possible
    2. Move "or_other" choice / InputQuestion insertion from builder.py, to allow a row-specific error and earlier finalisation of the choices lists
  4. builder.py: convert untyped dict tree structure into SurveyElement-based tree structure
    1. Avoid copying the input dict data, by removing modifications (e.g. pop) that would have made copying necessary
    2. For each Question with an itemset, use the Options (choices) instances from Survey instead of making new ones
    3. For SurveyElement, switch from using a dict to a slotted Mapping (dict interface but not a dict) to save memory and make it clearer what attributes are relevant to each class. For example an Option (which represents a choice) would get about 60 attributes (now ~8) - most of which empty strings but many are dicts or lists. Because a lot of code was not tolerant of None as a default value, all the fields were initialised, which wastes time and memory. This refactoring took a long time because many fields required software archaeology to find where is it used/tested/documented, and/or why/when was it added. Some more fields could probably be culled e.g. SMS/compact fields overlap and might be deprecated, etc.
    4. For Questions, merge the input kwargs with the question type dictionary (QTD - a bunch of bind/control/etc defaults for various question types) during __init__. Before, at the SurveyElement level, every attribute access (__getattr__) was being run through a check or merge against the QTD. For example Option()["label"] checked if "label" was in the QTD - lots of those 60 or so attributes are not in the QTD so a lot of wasted time. A worse situation existed for questions, e.g. for Question()["bind"] there would be a check against QTD, and since most question types have a "bind" default in the QTD, a copy would be created of the "bind" from QTD, add merged with the Question["bind"] value - not just once but every time "bind" was accessed, so a lot of wasted time and memory.
    5. On the QTD topic, the Question class still potentially is wasting some time/memory. The tests (mainly internal APIs) require that QTD default values are not output by SurveyElement.to_json_dict. For example if I create a start-geopoint, the "bind" info {"type": "geopoint"} is excluded. I think it's reasonable to include the QTD defaults since it's part of the Question definition, but to avoid additional confusion from more modified tests, for now the original input kwargs are stored separately from the QTD merged data, so that the QTD data can be excluded.
    6. Avoid iterating the SurveyElement tree structure where possible.
      1. Pass the Survey object into methods that need it, instead of iterating upwards from the current object to find it every time. For example Question.xml_instance gets called for every question, and self.get_root would go through 1 to N iterations of element.parent until it found the survey.
      2. Cache each element xpath in _survey_element_xpath after calculating it for the first time - there is no code that involves changing the tree after it is constructed, but just in case the cache is invalided by __setattr__ if "parent" changes.
      3. Add to iter_descendants a condition option to filter the returned elements during iteration. This doesn't reduce the traversal, but on the caller side fewer objects are returned for iteration/processing. For example in Survey.xml_descendent_bindings, Option and Tag don't have bindings so we can skip xml_bindings for them.
    7. For forms with translations, the language codes are validated against IANA subtags. Previous solution loaded all 8138 options (525KB) into a list by default then ran a membership check (O(n)) to search for each language. New solution loads 2-character codes first (190 items / 8KB) as a set (O(1)) then if that fails load and check the full list as a set. Best case: default, invalid, or short language codes get faster lookup with less memory used than before. Worst case: longer language codes get faster lookup with the same memory used as before.
  5. utils.py (node): convert SurveyElement-based tree structure into xml.dom.minidom.Element-based tree structure
    1. Check the type of only as many childNodes as needed instead of all of them
    2. Add combine_lists for coalescing/chaining iterables instead of creating new lists or concatenation
  6. survey.py: minidom.Element-based tree structure into XML string
  7. survey.py: write XML string to a temp file for ODK Validate to execute javarosa read check
  8. xls2xform.py: return converted data or write XML to file

General improvements applied in numerous places:

  1. Avoid using an empty dict as a default value for get and instead handle the None case.
  2. If a reference container (e.g. in constants.py) is used for a membership check (x in y), convert it to a set (O(1)) instead of a list (O(N)). Compared to a list, a set uses more memory (216B vs 56B) but generally these reference objects have one copy only so the speed tradeoff seems worthwhile - particularly SUPPORTED_MEDIA which is in a hot loop (referenced for each media dict key for any Question or Option).
  3. Use f-strings to insert variables instead of join or + to do the same thing
  4. yield / from for iteration instead of appending to a list and returning the list to do the same
  5. Use tuple instead of list where an ordered sequence is needed 1) save 16 bytes per object (40 vs 56), 2) avoid rogue modifications to sequences that probably shouldn't change mid-pipeline.

Further changes not included in this PR that could further help performance:

  1. The input data is copied at pipeline steps 2 (read), 3 (to untyped dict), 4 (to SurveyElement), 5 (to minidom Element), 6 (to string). Similar to my experience with the JVM, a Python process will not reliably release memory (even if after forced garbage collection) so to reduce process memory usage requires a focus on reducing peak usage. Some ideas:
    1. Adapt xls2json_backends and workbook_to_json to emit / accept generators instead of processing and passing the full XLSForm data. Potentially each row could be a generator as well.
    2. The data copying in pipeline step 5 (minidom Element) seems to consume the most memory. The SurveyElement could be adapted to output XML (i.e. avoid copying data to Elements at all), or the Element class could be changed to sax/event-driven incremental string output (Elements created/kept only as long as it takes to produce their XML). The XML probably can't be streamed back to the caller though because the entire document is needed by default for the ODK Validate check.
      1. Update ODK Validate to accept XForm input from stdin instead of reading only from files, to avoid the need for a tempfile during processing.
    3. Adapt workbook_to_json to produce the SurveyElement tree directly, instead of passing the data to the builder module. On that note the SurveyElement classes would need to (and probably should anyway) be able to deal with instantiating themselves without the builder module.
  2. As mentioned above I haven't really touched on the "relative references in repeats" recursion problem. For the relevant functions I did reduce the lru_cache arg maxsize values since 128 is the default and probably a reasonable point of diminishing return of performance vs. memory usage (seemed to be the case for parse_expression). About a year ago I changed these maxsize args from None (unbounded cache) to 65536 (a guess but at least not unbounded).
  3. Review and optimise strings processing/cleaning. It seems like strings are cleaned or examined multiple times during processing e.g. cleaning in xls2json_backends and workbook_to_json; checking for regex/token matches in workbook_to_json and SurveyElement processing. If it's too hard/awkward to consolidate completely then maybe each string could have metadata noted somehow e.g. "this string was cleaned already", and "this string does/not contain references".
  4. Some SurveyElement functions relate only to certain subclasses and could be moved to those classes, or a mixin / intermediate subclass. e.g. a fair bit of awkward "hasattr" or "isinstance" remains, not all classes have children or bind etc.

What are the regression risks?

Minimal tests fixes were required for these changes, so any issues should be more for projects using pyxform internal classes or APIs. It also means that if something is broken then we didn't have a test for it and so it should have a test. Potential compatibility issues:

  • removed some SurveyElement fields that seem to not be used at all, and SurveyElement classes will not have all 60+ fields anymore so any calling code will need to be tolerant of that by a) checking hasattr or isinstance and b) checking for None.
  • any extra data passed in as kwargs that is not a declared attribute is put into a new extra_data dict, and is excluded from to_json_dict output. The extra_data gets used when specifying extra choices columns, to generate choices output for that data.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No, but as mentioned above it would be good to check what the support situation is for SMS/compact and whether that is still needed, and check whether all the Survey attributes correspond to documented and supported features (either document them if they are supported, or remove them if they aren't).

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

- builder.py
  - replace "copy_json_dict" with copy.deepcopy since that's what it
    seems to be doing (probably less efficiently)
- question.py
  - for each SurveyElement subclass that overrides __init__, ensure
    that it calls the parent / super().__init__ (and calls it last)
  - to allow the above, rearrange the choices/children/tags into just
    "children" so that the base __init__ can add them.
  - generally the goal is to avoid use cases where the dict info from
    the XLSForm spec is modified after __init__ to help with the below
- survey_element.py
  - avoid recursive calls through __getattr__ by using the base class
    dict.get when those keys are known to exist and aren't in FIELDS
  - get the default info from the question_type_dict once at init time,
    instead of potentially looking it up on every call to __getattr__
  - avoid many calls to get_lineage by caching the xpath after it
    is calculated once, and invalidate the cache if the parent changes
    (although this doesn't seem to ever happen during test runs).
  - avoid many calls to iter_descendants by caching the result of
    "any_repeat" (dict can be quite large since it's every element).
    - This does not seem like the optimal solution yet since there are
      many related recursive calls that could be avoided
    - Also this function is only used for the survey class so it would
      make sense to move it there.
- for the form described in central/XLSForm#171, the form conversion total time
  is reduced from about 5 minutes to about 45 seconds. Resident memory
  usage about 43MB baseline and 150MB after conversion.
- instance.py
  - gather the survey xpaths since _xpath now tracks the objects
- survey.py
  - when flattening the objects in _setup_xpath_dict, keep the object
    references instead of just the xpath, for future use
  - skip calling the heavily recursive share_same_repeat_parent if the
    target and context items have no common ancestor that is a repeat
  - incorporate "any_repeat" func body into is_parent_a_repeat since
    it is not re-used elsewhere and to avoid an extra lru_cache.
- survey_element.py
  - add method to iterate ancestors, to find common items for relative
    references, rather than iterating down from the survey/root
  - add method to find nearest common ancestor repeat (if any).
    Currently only used to discern objects with no such ancestor but
    could be developed further to replace "share_same_repeat_parent".
- wastes cpu time and memory, but likely only noticeable on large forms
- presumably being done to avoid changing input dictionaries. Not sure
  if that is a necessary guarantee to make. Should be possible to avoid
  the need to copy dicts at all, by only reading from them instead of
  making changes during processing e.g. replace dict.pop() with
  dict.get() and skip using that key later.
- builder.py
  - in _create_section_from_dict (applies to Survey, Group, Repeat) the
    input dict data was being copied by copy then deepcopy for every
    child element e.g. firstly the entire Survey, then each Group or
    Repeat recursively in the data.
  - to avoid this while maintaining backwards compatibility, now the
    public method create_survey_element_from_dict will make a deepcopy
    of the input data (the entire Survey), and all private (recursive)
    methods will use that copy rather than making additional copies.
    - an exception is _name_and_label_substitutions which seems to still
      need copies in order to generate language-specific copies of data.
- question_type_dictionary.py
  - this reference dict was being copied for every SurveyElement
  - never modified so no need to make a copy
  - to avoid unintentional modifications in the future, replace the
    original object with a MappingProxy which is a read-only view.
- avoid repeatedly traversing the object tree to find the Survey, and
  instead pass it to where it is needed. Maybe a looks more verbose but
  it is faster, and clearer where the Survey is being used.
- entities.entity_declaration.py
  - use provided survey object instead of lookup with self.get_root()
- parsing.expression.py
  - use slotted class since it seems somewhat faster and lower memory
- parsing.instance_expression.py
  - inline function to allow left-to-right condition optimisation
- external_instance.py, question.py, section.py
  - use provided survey object instead of lookup with self.get_root()
- survey.py
  - filter iter_descendants types during iteration instead of after, to
    avoid processing them twice
  - convert most of model_children generator functions to return
    generators and pass them to node() as such to reduce intermediate
    lists held in memory
  - move "xml_descendent_bindings" and "xml_actions" from survey_element
    since they are only called by the Survey class
- survey_element.py
  - add option to filter iter_descendants and iter_ancestors by
    providng a condition callable
  - use provided survey object instead of lookup with self.get_root()
  - remove unused functions: get_lineage, get_root, get_media_keys
  - fix unnecessary value checks in xml_bindings (the key can only be
    one of the "if" options).
  - simplify xml_action
- utils.py
  - allow node() args to be a Generator, and if so append each element
- since py3.7 the order of keys is retained by the dict. A dict uses
  64 bytes whereas an OrderedDict uses 128 bytes.
- tidied up xls2json_backends imports by using only specific imports
  instead of all of openpyxl and xlrd, some of which overlapped.
- simplified _list_to_dict_list.
- pass choices info to new function so that the variables needed are in
  a different function scope:
  - peak memory usage: fewer live objects until workbook_to_json returns
  - debugging: many variables in workbook_to_json already
- also fix xpath_count test failure message (copypasta from xpath_match)
- previous solution loaded all 8138 options into a list by default then
  ran a membership check (O(n)) to search for each language.
- optimisations:
  - if the language name is "default", or is too short to match the
    language code regex, then skip executing the regex.
  - read the code list data at most once, only if needed.
  - put the code strings into a set for faster membership check (O(1)).
  - split the code list into the shorter list of 2-character codes
    (e.g. en, fr, de), and check that first. Assuming these shorter
    codes are more likely to be used, it is faster to check and uses
    less memory than loading the full list (8KB short vs 525KB).
- best case: default, invalid, or short language codes get faster
  lookup with less memory used than before.
- worst case: longer language codes get faster lookup with the same
  memory used as before.
- avoids creation of an intermediate list per repeat section, and calls
  to list.append() for each node
- avoids iterating through "children" elements that are not part of a
  section, e.g. Options of a Question
- the order does not seem to be significant in any usages - rather these
  variables are used for membership checks e.g. to validate input. In
  which case the set lookup is O(1) vs. O(N) which can be significant
  for large forms, when each question is checked against one or more of
  these collections, potentially more than once.
- in writexml(), use `any()` to avoid evaluating the whole sequence
  since it only matters if there is 1 or more NODE_TYPE_TEXT.
- in node()
  - use a set for faster lookup of blocked_attributes (it's 1 element
    but in case more are added it could stay O(1))
  - use a tuple for `unicode_args` for slightly less memory
  - remove unnecessary `iter()` on `kwargs.items()`
  - use one f-string for the parse string rather than concat 7 items
  - skip appendChild for `None` elements
- test_xform_conversion.py: remove unnecessary warnings arg
- test_builder.py:
  - remove self.maxDiff=None since this is done at the class level
  - swap AssertEqual arg order since first arg referred to as the
    "expected" value in test failure messages.
- test_choices_sheet.py: remove unnecessary position() assertion
- test_fieldlist_labels.py: add appearance assertion since this does
  not otherwise seem to be tested for groups
- test_fields.py: remove debug setting
- test_groups.py: add test for group relevance (no other tests for this)
- test_image_app_parameter.py: add assertion for appearance, since this
  does not otherwise seem to be tested for questions
- test_survey.py: add test for autoplay setting
- test_translations.py: fix typo in label translation header
- usage of lru_cache on `parse_expression` helps performance but it
  seems to be a diminishing return for cache sizes > 128, and the
  memory used by the cached strings and ExpressionLexerTokens can become
  significant if there are lots of long strings being parsed
- added option to get the parsed token type only, since calls through
  `is_single_token_expression` only care about the token type
- for token type checks, ignore empty strings or strings that would be
  to short to be that token type
- the main problem being addressed is that SurveyElement had a lot of
  fields which are not relevant to every subtype, and many of these
  fields are containers, which wastes time and memory.
  - For example an Option (which represents a choice) would get about
    60 attributes (now ~8) - most of which empty strings but many are
    dicts or lists. For large forms this can really stack up and consume
    a lot of memory. A lot of code was not tolerant of None as a default
    value so all the fields were initialised.
  - The dynamic class approach would have made more sense in the earlier
    days of XLSForm when the columns and behaviour were changing rapidly
    and there were not many fields. But now there are more features, and
    specs, and extensive test suites that make it useful to use a more
    clearly typed class approach.
  - Having concrete types also makes development easier, by having
    missing attribute or type warnings. When looking at the git history,
    clearly it was also confusing about what is a field - for example
    "jr:count" (repeat_count) is part of the bind dict, not a top-level
    attribute of a Section.
  - The changes in this commit could be taken further, for example:
    - some SurveyElement funcs relate to certain types and could be
      moved to those types, or a mixin / intermediate subclass. i.e.
      a fair bit of awkward "hasattr" or "ininstance" remains.
    - the builder.py module externalises the initialisation of the
      SurveyElements but perhaps this could be done by SurveyElements
- other issues addressed:
  - avoid copying dicts unnecessarily
  - use __slots__ on all SurveyElement types to reduce memory usage
  - do "or other" question/choice insertion in xls2xform instead of the
    builder module, so that row-specific errors can be shown, and the
    choice lists are finalised sooner
- potential compatibility issues:
  - minimal tests fixes were required for these changes, so any issues
    would be more for projects using pyxform internal classes or APIs.
  - removed some SurveyElement fields that seem to not be used at all,
    and SurveyElement classes will not have all 60+ fields anymore.
  - any extra data passed in as kwargs that is not a declared attribute
    is put into a new `extra_data` dict, and is excluded from
    `to_json_dict` output. The `extra_data` gets used when specifying
    extra choices columns, to generate choices output for that data.
@lindsay-stevens lindsay-stevens marked this pull request as ready for review December 2, 2024 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant