Fix type value for string variables #737

sandcha · 2018-10-09T14:47:21Z

Relates to #701

Bug detected by sending https://fr.openfisca.org/legislation/swagger example for /trace endpoint to local web api. The web api answer was:
{"error":"Internal server error: Object of type bytes is not JSON serializable"}

Technical changes

Fix an encoding bug of variables with value type str
Details:
- str corresponds to unicode in Python 2.7 and bytes in Python 3.7
- Set encoding with numpy.unicode_ to dynamically cast to unicode both in Python 2.7 and Python 3.7

Note:

A numpy array containing values for a variable with value_type = str, automatically converts the values to:

numpy.str_ in Python 2.7
numpy.bytes_ in Python 3.7

In this PR, we update such variables dtype in openfisca_core/variables.py and this dtype gives:

numpy.bytes_ for |S dtype value
numpy.unicode_ for |U or unicode_ dtype value

Morendil · 2018-10-09T21:15:50Z

LOL at this line.

Morendil · 2018-10-09T21:20:00Z

Breaking in Python 2 with

AssertionError: <type 'numpy.string_'> != <type 'numpy.unicode_'>

Awwww. I suggest foo.type in [numpy.str_, numpy.unicode_] or something like that. All we care about is getting a string, doesn't matter what kind of string.

sandcha · 2018-10-10T17:18:46Z

openfisca_core/variables.py

        if self.value_type == str:
            self.max_length = self.set(attr, 'max_length', allowed_type = int)
-            if self.max_length:
-                self.dtype = '|S{}'.format(self.max_length)


@fpagnoux For a value exceeding max_length, this line reduced the string to max_length. If it was done on purpose, let me know as this behaviour is removed here.

I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).

Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.

We're reversing this, as it turns out to result in a regression undetected by existing unit tests.

As I recall, @sandcha and I were pairing at the time and we had a miscommunication, I assumed truncation to max_length would still be happening elsewhere as a result of line 168, I had not understood we were eliminating the behaviour.

The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.

@sandcha suggests emitting a truncation warning at execution; I think that does not achieve the purpose of avoiding nasty surprises, because whoever is coding an input variable won't exert control on the actual inputs specified, and whoever actually specifies an overlong input is unlikely to see debug logs from the computation engine.

My preferred solution would be to rename max_length to something like truncate_to_length, to make the behaviour explicit; and a unit test for the truncation behaviour would be ideal.

Well, I'm just in favour of giving the information to the user somewhere like in a unit test explaining the expected behaviour. A PR is coming.

The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.

👍

My preferred solution would be to rename max_length to something like truncate_to_length

👎 Attribute names of a model are preferably nouns. variable.truncate_to_length sounds like a function, because truncate is a verb

make the behaviour explicit; and a unit test for the truncation behaviour would be ideal

👍 👍 👍

I think the name's not problematic, just that there is no test for something that's supposed to be a feature. Adding a test would be enough IMHO.

bonjourmauko · 2018-10-11T19:03:04Z

CHANGELOG.md

+
+- Fix a bug on encoding for OpenFisca variables of string type
+- Details:
+  * Set variable value encoding to `numpy.unicode_` in python 2.7 and python 3.7


I think an explanation of the situation would be very useful for other contributors.

bonjourmauko · 2018-10-11T19:03:10Z

tests/core/test_variables.py

+
+    simulation = new_simulation(tax_benefit_system, month)
+    variable_value = simulation.calculate('variable__str_with_max', month)[0]
+    assert_equal(unicode_, type(variable_value))


Usually, asserts works like this:

assert actual_value == desired_value assert_equal(actual_value, desired_value)

I'll change it quickly to merge this PR

fpagnoux

While this has already been merged, I don't think this PR brings the codebase to a consistent and acceptable state.

fpagnoux · 2018-10-11T20:53:36Z

CHANGELOG.md

-## 24.5.3 [#734](https://github.com/openfisca/openfisca-core/pull/734)
+### 24.5.3 [#737](https://github.com/openfisca/openfisca-core/pull/737)
+
+- Fix an encoding bug of variables with value type str


I'm coming late to the party, but I think a more human-readable and accurate description for this PR would be:

Use unicode for all string variables

Deprecate max_length variable attribute

which I think would have deserve maybe a little more thoughts and time.

For information, there is one drawback to using unicodes: they takes 4 times for space in memories than bytes. What are we getting in exchange? So far, we only use string variables to store zip codes...

I'm not against quick-merge for bug fix PR, but that's only okay if the PR has a limited scope and targets the bug, which is not the case here.

Agreed, the fix went beyond what was strictly necessary.

I'm not sure the "thoughts and time" remark is quite justified though, we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles), and both @maukoquiroga and I paired with @sandcha independently so I don't see this as a lack of individual attention or rushing.

I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.

Hi @fpagnoux.

Use unicode for all string variables

That's definitely better, although I didn't know "string" was a Python concept so that's why I didn't use it here.

Deprecate max_length variable attribute

Didn't know we were deprecating it, nor that there was a truncation feature.

Which I think would have deserve maybe a little more thoughts and time.

We were three to work on this specific issue, so more brains weren't available, and invested time was consequent.

Doing the math, we invested ballpark 4h * 3 = 12h of 75% of the team's brains, which is basically around ~10% of our sprint allocated time. I have to agree with @Morendil here:

we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles)

However, I assume the responsibility as I reviewed, cleaned up, rebased, renamed, fixed up and merged.

That being said, I don't know exactly what could I have done to avoid the situation.

What I meant by time was more absolute time than brain time:

A review is requested at D0, 2 days ago. A comment (where I'm mentioned) clearly seem to indicate that the PR openers don't fully understand the purpose of the code they are removing.

At D0+1, the PR is merged, with this comment still unanswered.

Later that day, I check the PR, following the mention. I'm surprised by the wide scope of the PR, and then realize France broke.

At D0+2, we revert everything

The pity is that we were very close from avoiding the incident. Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.

Having said that, misevaluating the scope of a PR happens. Now we can just:

extend unit tests to making it less likely

keep in mind than reverting/unpublishing is the best option it done quickly when it happens 🙂

I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.

To illustrate my remark:

import numpy as np postal_codes = ['75012', '75015', '75007'] bytes_array = np.asarray(postal_codes, dtype=np.bytes_) unicode_array = np.asarray(postal_codes, dtype=np.unicode_) print(unicode_array.nbytes) >>> 60 # 3 items x 20 bytes print(bytes_array.nbytes) >>> 15 # 3 items x 5 bytes

We should probably open a separate issues to discuss all the issues that we have identified.

Should all string variables be unicode, for convenience?

Should all string variables be bytes, for memory optimisation?

Should we allow both? Sounds nice, but is there really a demand for unicode variables?

Is the "max_length" field really useful? Should it throw errors instead of silently shortening user inputs?

Ah, the plot thickens. This is apparently not documented but here's an interesting discussion pointing out that Numpy's "U" dtype uses UTF-32 internally, rather than UTF-8 as I'd inferred from the Numpy docs.

The thread points to an additional option:

>>> object_array = np.asarray(postal_codes, dtype="O") >>> object_array.nbytes 24 >>> type(object_array[0]) <class 'str'>

Still some overhead, but limited.

Should all string variables be unicode, for convenience?

The alternative is that as soon as someone tries to use value_type = str for something - anything - that can conceivably contain even one accented character, they're getting set up for a nasty surprise. I think that goes beyond convenience, it's an egregious violation of the principle of least surprise.

If it's important for performance reasons, then one solution is to encode strings internally as byte arrays before storing them into Numpy arrays and decode them back into proper strings whenever their values is needed. To do that, we'd need to hide Numpy arrays (an implementation detail) behind the Variable/Holder (I'm not sure which) interface protocol. This is possibly a good idea (on information hiding principles) irrespective of the current issue, even if it would admittedly involve a lot of busywork.

Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.

Can't disagree, but we are analysing this now with the benefit of hindsight. In Allspaw's words "the action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place".

There are any number of additional counterfactual scenarios where merging the PR wouldn't have resulted in a meltdown, including "if we'd had more tests on string-valued variables" and "if we'd limited the fix to the least necessary change" or "if the people pairing on the fix had been more sensitive about memory usage".

fpagnoux · 2018-10-11T21:06:30Z

openfisca_core/variables.py

        if self.value_type == str:
            self.max_length = self.set(attr, 'max_length', allowed_type = int)
-            if self.max_length:
-                self.dtype = '|S{}'.format(self.max_length)


I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).

Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.

fpagnoux · 2018-10-11T21:49:30Z

I'm also under the impression that this PR make France calculations crash.
When I try to run nosetests tests/test_basics.py in france, I get:
TypeError: startswith first arg must be str or a tuple of str, not numpy.bytes_

https://fr.openfisca.org/legislation/swagger currently works (both /calculate and /trace), but it's based on version 24.5.2, i.e. before this PR. I'm afraid it would crash if we upgraded Core to the current master.

I'd therefore recommend to revert this PR.

Morendil · 2018-10-12T10:21:08Z

I can see France builds passing based on Core 24.5.4 (with the max-length condition restored).

(ETA: ugh, I spoke too soon. A side-effect of #737 is that openfisca/openfisca-france#1160 is required to make France tests go back to green. The builds that are passing are precisely the builds on that PR. The others are not passing.)

bonjourmauko · 2018-10-12T11:17:02Z

I'd therefore recommend to revert this PR.

Fix #738 restores max_length, but does not reverts this PR's fix (it actually fixes the fix).

Test type value for str variable with max length

2291718

sandcha added the flow:team:doing label Oct 10, 2018

sandcha self-assigned this Oct 10, 2018

sandcha changed the title ~~[WIP] Fix type value for string variables~~ Fix type value for string variables Oct 10, 2018

sandcha requested a review from fpagnoux October 10, 2018 15:33

sandcha commented Oct 10, 2018

View reviewed changes

bonjourmauko self-assigned this Oct 11, 2018

bonjourmauko added kind:fix Bugs are defects and failure demand. flow:team:reviewing and removed flow:team:doing labels Oct 11, 2018

bonjourmauko reviewed Oct 11, 2018

View reviewed changes

bonjourmauko requested changes Oct 11, 2018

View reviewed changes

bonjourmauko approved these changes Oct 11, 2018

View reviewed changes

bonjourmauko force-pushed the fix-trace-str-var branch 4 times, most recently from 4d97de0 to f893c91 Compare October 11, 2018 19:36

sandcha and others added 11 commits October 11, 2018 21:36

Fix type value for str variable with max length in python 3

7826840

Update test_trace to check unicode in py 2 and py 3

ffda05e

Explicit test trace name on strvariables

3d82b45

Fix init on str variables with max length

52fccdc

Group init steps for enum variables

e45b8f2

Explicit dtype for str variables init

4af2928

Simplify simulation init in test trace

e7fabb2

Move test trace on str variables to variables test file

0b51a70

Fix import order

0a3ae9a

Fix CHANGELOG chapters format

b91f964

Bump version to 24.5.3

f893c91

bonjourmauko merged commit ed462ec into master Oct 11, 2018

bonjourmauko removed the flow:team:reviewing label Oct 11, 2018

bonjourmauko deleted the fix-trace-str-var branch October 11, 2018 19:39

fpagnoux reviewed Oct 11, 2018

View reviewed changes

sandcha mentioned this pull request Oct 12, 2018

Restore max_length shortening effect on variable of type str #738

Merged

sandcha mentioned this pull request Oct 12, 2018

Traite les codes postaux comme chaînes et non tableaux d'octets openfisca/openfisca-france#1160

Merged

7 tasks

sandcha mentioned this pull request Oct 12, 2018

Fix str variable test and default_value encoding #739

Closed

Morendil mentioned this pull request Oct 19, 2018

Use PyTest #746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix type value for string variables #737

Fix type value for string variables #737

sandcha commented Oct 9, 2018 •

edited by bonjourmauko

Loading

Morendil commented Oct 9, 2018

Morendil commented Oct 9, 2018

sandcha Oct 10, 2018

fpagnoux Oct 11, 2018

Morendil Oct 12, 2018

sandcha Oct 12, 2018 •

edited

Loading

bonjourmauko Oct 12, 2018 •

edited

Loading

bonjourmauko Oct 11, 2018

bonjourmauko Oct 11, 2018

bonjourmauko Oct 11, 2018

fpagnoux left a comment

fpagnoux Oct 11, 2018

Morendil Oct 12, 2018

bonjourmauko Oct 12, 2018

fpagnoux Oct 12, 2018 •

edited

Loading

fpagnoux Oct 12, 2018

fpagnoux Oct 12, 2018

Morendil Oct 12, 2018 •

edited

Loading

Morendil Oct 12, 2018

fpagnoux Oct 11, 2018

fpagnoux commented Oct 11, 2018 •

edited

Loading

Morendil commented Oct 12, 2018 •

edited

Loading

bonjourmauko commented Oct 12, 2018

Fix type value for string variables #737

Fix type value for string variables #737

Conversation

sandcha commented Oct 9, 2018 • edited by bonjourmauko Loading

Technical changes

Morendil commented Oct 9, 2018

Morendil commented Oct 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandcha Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

bonjourmauko Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpagnoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpagnoux Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Morendil Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpagnoux commented Oct 11, 2018 • edited Loading

Morendil commented Oct 12, 2018 • edited Loading

bonjourmauko commented Oct 12, 2018

sandcha commented Oct 9, 2018 •

edited by bonjourmauko

Loading

sandcha Oct 12, 2018 •

edited

Loading

bonjourmauko Oct 12, 2018 •

edited

Loading

fpagnoux Oct 12, 2018 •

edited

Loading

Morendil Oct 12, 2018 •

edited

Loading

fpagnoux commented Oct 11, 2018 •

edited

Loading

Morendil commented Oct 12, 2018 •

edited

Loading