More performance improvements #743

lindsay-stevens · 2024-12-10T00:53:49Z

A few more general ideas following #740

Why is this the best possible solution? Were any other approaches considered?

See commit messages for details of each change. Overall a modest but worthwhile improvement. I think to get much more will require addressing inefficiencies in XML generation, which may require overriding more of of minidom's Element class and perhaps Attr as well. Also the relative repeats recursion issue remains.

What are the regression risks?

Should be minimal risk, the only tests changes were to update the performance test stats.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run python -m unittest and verified all tests pass
run ruff format pyxform tests and ruff check pyxform tests to lint code
verified that any code or assets from external sources are properly credited in comments

- iter_descendants changes: - in the base class, handle the generic case of no children - in section.py, always iterate into children (questions/groups) - in question.py, iterate into children only if new kwarg calls for it (actually not used right now but left it if it's needed). - survey.py - in setup_media(), invert `condition` as the positive list is shorter - in setup_xpath_dict(), store the element name to avoid extra lookup in tight loop (all questions/sections) - update self._xpath annotation to match condition filter - xls2json.py - more efficient merge_dicts()

- detect and re-use tuple[Option] rather than creating a new tuple containing those same Options by default, otherwise create Options from an Iterable[dict] as was done before.

- fewer string objects created = less memory used + faster - also in utils.py, use _attrs dict directly instead of looking up the key repeatedly (_get_attributes returns a NamedNodeMap object which is pretty much the same thing).

- instance_expression.py - not possible to match "instance(" if the string is too short - skip `find_boundaries` func body if no tokens found - skip `replace_with_output` func body if no boundaries found - utils.py - single-pass substitution of XML tokens with translation table - survey.py - skip `insert_output_values` func body if placeholder "-" found - instead of creating XML node just to do string replacement, add func to do the same. Checks for replacement characters first since CPU cost of that seems less a problem vs CPU + memory of creating new strings (due to replacement), particularly if the majority of strings don't have any matches. - updated perf test stats, so far since 2c878a3 about 10-20% faster, up to 10% less memory

- classes have attributes defined now so no need to use them like dicts - provides better warnings / navigation / refactoring via IDE - variable impact on performance (some slightly better, some worse) - also converted a few more string concatenations to f-strings

@ODK

- when a "compact_tag" is specified in a XLSForm, this is translated by xls2json.py processing into "instance::odk:tag" (per aliases.py) which is then stored as Question.instance e.g. `{"odk:tag": "value"}`. This info then used by Question.xml_instance() to output these data as attributes on the instance element e.g. `<name @ODK:tag="value"/>`

- expression.py - remove is_single_token_expression and instead look for the token type of interest only. This requires making the lexer rules available outside of get_expression_lexer() and unfortunately compiling an extra time since the Scanner combines all regexes into one before compiling (so you can't give it pre-compiled regex). - change PYXFORM_REF regex to specifically look for optional `last-saved#` prefix instead of any ncname before a `#`. Not sure why it was like that but `last-saved` is quicker than ncname regex. - pyxform_reference.py - add some sanity checks on input `value` to avoid more expensive parsing in the remaining function body - survey.py - in _generate_last_saved_instance, combine the check performed on default/choice_filter/bind(items) into a func (in expression.py) - refactor check of bind items to more efficient method (per comment).

- previously just took RSS value at end of last "convert" but this can be affected by garbage collection mid-run, so now the RSS is polled after the run and the largest value is printed with average timings. - expression.py: seems that `match` a bit faster than `fullmatch`

- test provides internal dict structure directly but provides a non-string input for "default" whereas normally these values are provided from XLS/X or MD as strings. - the failure is at the function `has_last_saved` which looks for the length of a value but that is a TypeError for `int` - probably there are other places that would expect strings as well.

lognaturel · 2024-12-19T00:07:37Z

pyxform/question.py

@@ -391,6 +391,22 @@ def validate(self):
            for child in self.children:
                child.validate()

+    def iter_descendants(


Maybe this implementation could go in some sort of shared place? Everyone's favorite, a util file? I'm worried about these identical implementations drifting apart...

Agree 3x copy/pastas = probably time for a shared func. I didn't do that here because of the points noted in #745. In which case the only copy of this method override for iter_descendants would be MultipleChoice.

lognaturel

Nice! 🏃‍♀️

lindsay-stevens added 5 commits December 10, 2024 11:42

fix: improve __init__ performance for MultipleChoice and OSM Tag

6eba8dd

- detect and re-use tuple[Option] rather than creating a new tuple containing those same Options by default, otherwise create Options from an Iterable[dict] as was done before.

lindsay-stevens marked this pull request as ready for review December 10, 2024 00:53

lindsay-stevens requested a review from lognaturel December 10, 2024 00:56

lindsay-stevens added 4 commits December 18, 2024 23:58

lognaturel reviewed Dec 19, 2024

View reviewed changes

lognaturel approved these changes Dec 19, 2024

View reviewed changes

lindsay-stevens merged commit 854ffe6 into XLSForm:master Dec 24, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More performance improvements #743

More performance improvements #743

lindsay-stevens commented Dec 10, 2024

lognaturel Dec 19, 2024

lindsay-stevens Dec 24, 2024

lognaturel left a comment

More performance improvements #743

More performance improvements #743

Conversation

lindsay-stevens commented Dec 10, 2024

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

lognaturel Dec 19, 2024

Choose a reason for hiding this comment

lindsay-stevens Dec 24, 2024

Choose a reason for hiding this comment

lognaturel left a comment

Choose a reason for hiding this comment