Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve complex forms performance #740

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Commits on Nov 14, 2024

  1. fix: poor performance with large and complex forms with many references

    - builder.py
      - replace "copy_json_dict" with copy.deepcopy since that's what it
        seems to be doing (probably less efficiently)
    - question.py
      - for each SurveyElement subclass that overrides __init__, ensure
        that it calls the parent / super().__init__ (and calls it last)
      - to allow the above, rearrange the choices/children/tags into just
        "children" so that the base __init__ can add them.
      - generally the goal is to avoid use cases where the dict info from
        the XLSForm spec is modified after __init__ to help with the below
    - survey_element.py
      - avoid recursive calls through __getattr__ by using the base class
        dict.get when those keys are known to exist and aren't in FIELDS
      - get the default info from the question_type_dict once at init time,
        instead of potentially looking it up on every call to __getattr__
      - avoid many calls to get_lineage by caching the xpath after it
        is calculated once, and invalidate the cache if the parent changes
        (although this doesn't seem to ever happen during test runs).
      - avoid many calls to iter_descendants by caching the result of
        "any_repeat" (dict can be quite large since it's every element).
        - This does not seem like the optimal solution yet since there are
          many related recursive calls that could be avoided
        - Also this function is only used for the survey class so it would
          make sense to move it there.
    - for the form described in central/XLSForm#171, the form conversion total time
      is reduced from about 5 minutes to about 45 seconds. Resident memory
      usage about 43MB baseline and 150MB after conversion.
    lindsay-stevens committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    42bb2d6 View commit details
    Browse the repository at this point in the history
  2. fix: poor performance with large and complex forms with many references

    - instance.py
      - gather the survey xpaths since _xpath now tracks the objects
    - survey.py
      - when flattening the objects in _setup_xpath_dict, keep the object
        references instead of just the xpath, for future use
      - skip calling the heavily recursive share_same_repeat_parent if the
        target and context items have no common ancestor that is a repeat
      - incorporate "any_repeat" func body into is_parent_a_repeat since
        it is not re-used elsewhere and to avoid an extra lru_cache.
    - survey_element.py
      - add method to iterate ancestors, to find common items for relative
        references, rather than iterating down from the survey/root
      - add method to find nearest common ancestor repeat (if any).
        Currently only used to discern objects with no such ancestor but
        could be developed further to replace "share_same_repeat_parent".
    lindsay-stevens committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    3a9d0bb View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2024

  1. fix: avoid unnecessary dict.copy() or copy.deepcopy()

    - wastes cpu time and memory, but likely only noticeable on large forms
    - presumably being done to avoid changing input dictionaries. Not sure
      if that is a necessary guarantee to make. Should be possible to avoid
      the need to copy dicts at all, by only reading from them instead of
      making changes during processing e.g. replace dict.pop() with
      dict.get() and skip using that key later.
    - builder.py
      - in _create_section_from_dict (applies to Survey, Group, Repeat) the
        input dict data was being copied by copy then deepcopy for every
        child element e.g. firstly the entire Survey, then each Group or
        Repeat recursively in the data.
      - to avoid this while maintaining backwards compatibility, now the
        public method create_survey_element_from_dict will make a deepcopy
        of the input data (the entire Survey), and all private (recursive)
        methods will use that copy rather than making additional copies.
        - an exception is _name_and_label_substitutions which seems to still
          need copies in order to generate language-specific copies of data.
    - question_type_dictionary.py
      - this reference dict was being copied for every SurveyElement
      - never modified so no need to make a copy
      - to avoid unintentional modifications in the future, replace the
        original object with a MappingProxy which is a read-only view.
    lindsay-stevens committed Nov 16, 2024
    Configuration menu
    Copy the full SHA
    a84e75f View commit details
    Browse the repository at this point in the history
  2. fix: avoid repeatedly traversing SurveyElements to find the Survey

    - avoid repeatedly traversing the object tree to find the Survey, and
      instead pass it to where it is needed. Maybe a looks more verbose but
      it is faster, and clearer where the Survey is being used.
    - entities.entity_declaration.py
      - use provided survey object instead of lookup with self.get_root()
    - parsing.expression.py
      - use slotted class since it seems somewhat faster and lower memory
    - parsing.instance_expression.py
      - inline function to allow left-to-right condition optimisation
    - external_instance.py, question.py, section.py
      - use provided survey object instead of lookup with self.get_root()
    - survey.py
      - filter iter_descendants types during iteration instead of after, to
        avoid processing them twice
      - convert most of model_children generator functions to return
        generators and pass them to node() as such to reduce intermediate
        lists held in memory
      - move "xml_descendent_bindings" and "xml_actions" from survey_element
        since they are only called by the Survey class
    - survey_element.py
      - add option to filter iter_descendants and iter_ancestors by
        providng a condition callable
      - use provided survey object instead of lookup with self.get_root()
      - remove unused functions: get_lineage, get_root, get_media_keys
      - fix unnecessary value checks in xml_bindings (the key can only be
        one of the "if" options).
      - simplify xml_action
    - utils.py
      - allow node() args to be a Generator, and if so append each element
    lindsay-stevens committed Nov 16, 2024
    Configuration menu
    Copy the full SHA
    188cb51 View commit details
    Browse the repository at this point in the history

Commits on Nov 23, 2024

  1. fix: reduce memory usage by switching from OrderedDict to dict

    - since py3.7 the order of keys is retained by the dict. A dict uses
      64 bytes whereas an OrderedDict uses 128 bytes.
    - tidied up xls2json_backends imports by using only specific imports
      instead of all of openpyxl and xlrd, some of which overlapped.
    - simplified _list_to_dict_list.
    lindsay-stevens committed Nov 23, 2024
    Configuration menu
    Copy the full SHA
    a33a4a8 View commit details
    Browse the repository at this point in the history
  2. fix: simplify choices validation in workbook_to_json

    - pass choices info to new function so that the variables needed are in
      a different function scope:
      - peak memory usage: fewer live objects until workbook_to_json returns
      - debugging: many variables in workbook_to_json already
    lindsay-stevens committed Nov 23, 2024
    Configuration menu
    Copy the full SHA
    54c7d28 View commit details
    Browse the repository at this point in the history

Commits on Nov 25, 2024

  1. fix: when testing use converted xform xml instead of converting again

    - also fix xpath_count test failure message (copypasta from xpath_match)
    lindsay-stevens committed Nov 25, 2024
    Configuration menu
    Copy the full SHA
    24e2ecc View commit details
    Browse the repository at this point in the history
  2. fix: iana subtag language lookup optimisations

    - previous solution loaded all 8138 options into a list by default then
      ran a membership check (O(n)) to search for each language.
    - optimisations:
      - if the language name is "default", or is too short to match the
        language code regex, then skip executing the regex.
      - read the code list data at most once, only if needed.
      - put the code strings into a set for faster membership check (O(1)).
      - split the code list into the shorter list of 2-character codes
        (e.g. en, fr, de), and check that first. Assuming these shorter
        codes are more likely to be used, it is faster to check and uses
        less memory than loading the full list (8KB short vs 525KB).
    - best case: default, invalid, or short language codes get faster
      lookup with less memory used than before.
    - worst case: longer language codes get faster lookup with the same
      memory used as before.
    lindsay-stevens committed Nov 25, 2024
    Configuration menu
    Copy the full SHA
    2a4ba80 View commit details
    Browse the repository at this point in the history
  3. fix: generate dynamic default setvalues instead of accumulating to list

    - avoids creation of an intermediate list per repeat section, and calls
      to list.append() for each node
    - avoids iterating through "children" elements that are not part of a
      section, e.g. Options of a Question
    lindsay-stevens committed Nov 25, 2024
    Configuration menu
    Copy the full SHA
    48b3969 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2024

  1. fix: change constants sequences to sets for faster lookup

    - the order does not seem to be significant in any usages - rather these
      variables are used for membership checks e.g. to validate input. In
      which case the set lookup is O(1) vs. O(N) which can be significant
      for large forms, when each question is checked against one or more of
      these collections, potentially more than once.
    lindsay-stevens committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    18c5649 View commit details
    Browse the repository at this point in the history
  2. fix: utils performance improvements

    - in writexml(), use `any()` to avoid evaluating the whole sequence
      since it only matters if there is 1 or more NODE_TYPE_TEXT.
    - in node()
      - use a set for faster lookup of blocked_attributes (it's 1 element
        but in case more are added it could stay O(1))
      - use a tuple for `unicode_args` for slightly less memory
      - remove unnecessary `iter()` on `kwargs.items()`
      - use one f-string for the parse string rather than concat 7 items
      - skip appendChild for `None` elements
    lindsay-stevens committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    f1c47f5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    028f26e View commit details
    Browse the repository at this point in the history
  4. add: add missing test cases, other minor tests improvements

    - test_xform_conversion.py: remove unnecessary warnings arg
    - test_builder.py:
      - remove self.maxDiff=None since this is done at the class level
      - swap AssertEqual arg order since first arg referred to as the
        "expected" value in test failure messages.
    - test_choices_sheet.py: remove unnecessary position() assertion
    - test_fieldlist_labels.py: add appearance assertion since this does
      not otherwise seem to be tested for groups
    - test_fields.py: remove debug setting
    - test_groups.py: add test for group relevance (no other tests for this)
    - test_image_app_parameter.py: add assertion for appearance, since this
      does not otherwise seem to be tested for questions
    - test_survey.py: add test for autoplay setting
    - test_translations.py: fix typo in label translation header
    lindsay-stevens committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    ff99279 View commit details
    Browse the repository at this point in the history
  5. chg: expression parsing optimisation

    - usage of lru_cache on `parse_expression` helps performance but it
      seems to be a diminishing return for cache sizes > 128, and the
      memory used by the cached strings and ExpressionLexerTokens can become
      significant if there are lots of long strings being parsed
    - added option to get the parsed token type only, since calls through
      `is_single_token_expression` only care about the token type
    - for token type checks, ignore empty strings or strings that would be
      to short to be that token type
    lindsay-stevens committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    61c4df1 View commit details
    Browse the repository at this point in the history
  6. chg: performance and memory usage improvements

    - the main problem being addressed is that SurveyElement had a lot of
      fields which are not relevant to every subtype, and many of these
      fields are containers, which wastes time and memory.
      - For example an Option (which represents a choice) would get about
        60 attributes (now ~8) - most of which empty strings but many are
        dicts or lists. For large forms this can really stack up and consume
        a lot of memory. A lot of code was not tolerant of None as a default
        value so all the fields were initialised.
      - The dynamic class approach would have made more sense in the earlier
        days of XLSForm when the columns and behaviour were changing rapidly
        and there were not many fields. But now there are more features, and
        specs, and extensive test suites that make it useful to use a more
        clearly typed class approach.
      - Having concrete types also makes development easier, by having
        missing attribute or type warnings. When looking at the git history,
        clearly it was also confusing about what is a field - for example
        "jr:count" (repeat_count) is part of the bind dict, not a top-level
        attribute of a Section.
      - The changes in this commit could be taken further, for example:
        - some SurveyElement funcs relate to certain types and could be
          moved to those types, or a mixin / intermediate subclass. i.e.
          a fair bit of awkward "hasattr" or "ininstance" remains.
        - the builder.py module externalises the initialisation of the
          SurveyElements but perhaps this could be done by SurveyElements
    - other issues addressed:
      - avoid copying dicts unnecessarily
      - use __slots__ on all SurveyElement types to reduce memory usage
      - do "or other" question/choice insertion in xls2xform instead of the
        builder module, so that row-specific errors can be shown, and the
        choice lists are finalised sooner
    - potential compatibility issues:
      - minimal tests fixes were required for these changes, so any issues
        would be more for projects using pyxform internal classes or APIs.
      - removed some SurveyElement fields that seem to not be used at all,
        and SurveyElement classes will not have all 60+ fields anymore.
      - any extra data passed in as kwargs that is not a declared attribute
        is put into a new `extra_data` dict, and is excluded from
        `to_json_dict` output. The `extra_data` gets used when specifying
        extra choices columns, to generate choices output for that data.
    lindsay-stevens committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    6918b40 View commit details
    Browse the repository at this point in the history