Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More performance improvements #743

Merged
merged 9 commits into from
Dec 24, 2024

Conversation

lindsay-stevens
Copy link
Contributor

A few more general ideas following #740

Why is this the best possible solution? Were any other approaches considered?

See commit messages for details of each change. Overall a modest but worthwhile improvement. I think to get much more will require addressing inefficiencies in XML generation, which may require overriding more of of minidom's Element class and perhaps Attr as well. Also the relative repeats recursion issue remains.

What are the regression risks?

Should be minimal risk, the only tests changes were to update the performance test stats.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

- iter_descendants changes:
  - in the base class, handle the generic case of no children
  - in section.py, always iterate into children (questions/groups)
  - in question.py, iterate into children only if new kwarg calls for
    it (actually not used right now but left it if it's needed).
- survey.py
  - in setup_media(), invert `condition` as the positive list is shorter
  - in setup_xpath_dict(), store the element name to avoid extra lookup
    in tight loop (all questions/sections)
    - update self._xpath annotation to match condition filter
- xls2json.py
  - more efficient merge_dicts()
- detect and re-use tuple[Option] rather than creating a new tuple
  containing those same Options by default, otherwise create Options
  from an Iterable[dict] as was done before.
- fewer string objects created = less memory used + faster
- also in utils.py, use _attrs dict directly instead of looking up the
  key repeatedly (_get_attributes returns a NamedNodeMap object which
  is pretty much the same thing).
- instance_expression.py
  - not possible to match "instance(" if the string is too short
  - skip `find_boundaries` func body if no tokens found
  - skip `replace_with_output` func body if no boundaries found
- utils.py
  - single-pass substitution of XML tokens with translation table
- survey.py
  - skip `insert_output_values` func body if placeholder "-" found
  - instead of creating XML node just to do string replacement, add
    func to do the same. Checks for replacement characters first since
    CPU cost of that seems less a problem vs CPU + memory of creating
    new strings (due to replacement), particularly if the majority of
    strings don't have any matches.
- updated perf test stats, so far since 2c878a3 about 10-20% faster,
  up to 10% less memory
- classes have attributes defined now so no need to use them like dicts
- provides better warnings / navigation / refactoring via IDE
- variable impact on performance (some slightly better, some worse)
- also converted a few more string concatenations to f-strings
@lindsay-stevens lindsay-stevens marked this pull request as ready for review December 10, 2024 00:53
- when a "compact_tag" is specified in a XLSForm, this is translated by
  xls2json.py processing into "instance::odk:tag" (per aliases.py) which
  is then stored as Question.instance e.g. `{"odk:tag": "value"}`. This
  info then used by Question.xml_instance() to output these data as
  attributes on the instance element e.g. `<name @ODK:tag="value"/>`
- expression.py
  - remove is_single_token_expression and instead look for the token
    type of interest only. This requires making the lexer rules
    available outside of get_expression_lexer() and unfortunately
    compiling an extra time since the Scanner combines all regexes into
    one before compiling (so you can't give it pre-compiled regex).
  - change PYXFORM_REF regex to specifically look for optional
    `last-saved#` prefix instead of any ncname before a `#`. Not sure
    why it was like that but `last-saved` is quicker than ncname regex.
- pyxform_reference.py
  - add some sanity checks on input `value` to avoid more expensive
    parsing in the remaining function body
- survey.py
  - in _generate_last_saved_instance, combine the check performed on
    default/choice_filter/bind(items) into a func (in expression.py)
  - refactor check of bind items to more efficient method (per comment).
- previously just took RSS value at end of last "convert" but this can
  be affected by garbage collection mid-run, so now the RSS is polled
  after the run and the largest value is printed with average timings.
- expression.py: seems that `match` a bit faster than `fullmatch`
- test provides internal dict structure directly but provides a
  non-string input for "default" whereas normally these values are
  provided from XLS/X or MD as strings.
- the failure is at the function `has_last_saved` which looks for the
  length of a value but that is a TypeError for `int` - probably there
  are other places that would expect strings as well.
@@ -391,6 +391,22 @@ def validate(self):
for child in self.children:
child.validate()

def iter_descendants(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this implementation could go in some sort of shared place? Everyone's favorite, a util file? I'm worried about these identical implementations drifting apart...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 3x copy/pastas = probably time for a shared func. I didn't do that here because of the points noted in #745. In which case the only copy of this method override for iter_descendants would be MultipleChoice.

Copy link
Contributor

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🏃‍♀️

@lindsay-stevens lindsay-stevens merged commit 854ffe6 into XLSForm:master Dec 24, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants