-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix type value for string variables #737
Conversation
LOL at this line. |
Breaking in Python 2 with
Awwww. I suggest |
if self.value_type == str: | ||
self.max_length = self.set(attr, 'max_length', allowed_type = int) | ||
if self.max_length: | ||
self.dtype = '|S{}'.format(self.max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpagnoux For a value exceeding
max_length
, this line reduced the string tomax_length
. If it was done on purpose, let me know as this behaviour is removed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).
Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length
variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're reversing this, as it turns out to result in a regression undetected by existing unit tests.
As I recall, @sandcha and I were pairing at the time and we had a miscommunication, I assumed truncation to max_length would still be happening elsewhere as a result of line 168, I had not understood we were eliminating the behaviour.
The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.
@sandcha suggests emitting a truncation warning at execution; I think that does not achieve the purpose of avoiding nasty surprises, because whoever is coding an input variable won't exert control on the actual inputs specified, and whoever actually specifies an overlong input is unlikely to see debug logs from the computation engine.
My preferred solution would be to rename max_length to something like truncate_to_length, to make the behaviour explicit; and a unit test for the truncation behaviour would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm just in favour of giving the information to the user somewhere like in a unit test explaining the expected behaviour. A PR is coming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.
👍
My preferred solution would be to rename max_length to something like truncate_to_length
👎 Attribute names of a model are preferably nouns. variable.truncate_to_length
sounds like a function
, because truncate
is a verb
make the behaviour explicit; and a unit test for the truncation behaviour would be ideal
👍 👍 👍
I think the name's not problematic, just that there is no test for something that's supposed to be a feature. Adding a test would be enough IMHO.
CHANGELOG.md
Outdated
|
||
- Fix a bug on encoding for OpenFisca variables of string type | ||
- Details: | ||
* Set variable value encoding to `numpy.unicode_` in python 2.7 and python 3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an explanation of the situation would be very useful for other contributors.
tests/core/test_variables.py
Outdated
|
||
simulation = new_simulation(tax_benefit_system, month) | ||
variable_value = simulation.calculate('variable__str_with_max', month)[0] | ||
assert_equal(unicode_, type(variable_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, asserts works like this:
assert actual_value == desired_value
assert_equal(actual_value, desired_value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it quickly to merge this PR
4d97de0
to
f893c91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this has already been merged, I don't think this PR brings the codebase to a consistent and acceptable state.
## 24.5.3 [#734](https://github.com/openfisca/openfisca-core/pull/734) | ||
### 24.5.3 [#737](https://github.com/openfisca/openfisca-core/pull/737) | ||
|
||
- Fix an encoding bug of variables with value type str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming late to the party, but I think a more human-readable and accurate description for this PR would be:
- Use unicode for all string variables
- Deprecate
max_length
variable attribute
which I think would have deserve maybe a little more thoughts and time.
For information, there is one drawback to using unicodes: they takes 4 times for space in memories than bytes. What are we getting in exchange? So far, we only use string variables to store zip codes...
I'm not against quick-merge for bug fix PR, but that's only okay if the PR has a limited scope and targets the bug, which is not the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the fix went beyond what was strictly necessary.
I'm not sure the "thoughts and time" remark is quite justified though, we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles), and both @maukoquiroga and I paired with @sandcha independently so I don't see this as a lack of individual attention or rushing.
I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fpagnoux.
Use unicode for all string variables
That's definitely better, although I didn't know "string" was a Python concept so that's why I didn't use it here.
Deprecate max_length variable attribute
Didn't know we were deprecating it, nor that there was a truncation feature.
Which I think would have deserve maybe a little more thoughts and time.
We were three to work on this specific issue, so more brains weren't available, and invested time was consequent.
Doing the math, we invested ballpark 4h * 3 = 12h of 75% of the team's brains, which is basically around ~10% of our sprint allocated time. I have to agree with @Morendil here:
we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles)
However, I assume the responsibility as I reviewed, cleaned up, rebased, renamed, fixed up and merged.
That being said, I don't know exactly what could I have done to avoid the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant by time was more absolute time than brain time:
- A review is requested at D0, 2 days ago. A comment (where I'm mentioned) clearly seem to indicate that the PR openers don't fully understand the purpose of the code they are removing.
- At D0+1, the PR is merged, with this comment still unanswered.
- Later that day, I check the PR, following the mention. I'm surprised by the wide scope of the PR, and then realize France broke.
- At D0+2, we revert everything
The pity is that we were very close from avoiding the incident. Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.
Having said that, misevaluating the scope of a PR happens. Now we can just:
- extend unit tests to making it less likely
- keep in mind than reverting/unpublishing is the best option it done quickly when it happens 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.
To illustrate my remark:
import numpy as np
postal_codes = ['75012', '75015', '75007']
bytes_array = np.asarray(postal_codes, dtype=np.bytes_)
unicode_array = np.asarray(postal_codes, dtype=np.unicode_)
print(unicode_array.nbytes)
>>> 60 # 3 items x 20 bytes
print(bytes_array.nbytes)
>>> 15 # 3 items x 5 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably open a separate issues to discuss all the issues that we have identified.
- Should all string variables be unicode, for convenience?
- Should all string variables be bytes, for memory optimisation?
- Should we allow both? Sounds nice, but is there really a demand for unicode variables?
- Is the "max_length" field really useful? Should it throw errors instead of silently shortening user inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the plot thickens. This is apparently not documented but here's an interesting discussion pointing out that Numpy's "U" dtype uses UTF-32 internally, rather than UTF-8 as I'd inferred from the Numpy docs.
The thread points to an additional option:
>>> object_array = np.asarray(postal_codes, dtype="O")
>>> object_array.nbytes
24
>>> type(object_array[0])
<class 'str'>
Still some overhead, but limited.
Should all string variables be unicode, for convenience?
The alternative is that as soon as someone tries to use value_type = str
for something - anything - that can conceivably contain even one accented character, they're getting set up for a nasty surprise. I think that goes beyond convenience, it's an egregious violation of the principle of least surprise.
If it's important for performance reasons, then one solution is to encode strings internally as byte arrays before storing them into Numpy arrays and decode them back into proper strings whenever their values is needed. To do that, we'd need to hide Numpy arrays (an implementation detail) behind the Variable/Holder (I'm not sure which) interface protocol. This is possibly a good idea (on information hiding principles) irrespective of the current issue, even if it would admittedly involve a lot of busywork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.
Can't disagree, but we are analysing this now with the benefit of hindsight. In Allspaw's words "the action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place".
There are any number of additional counterfactual scenarios where merging the PR wouldn't have resulted in a meltdown, including "if we'd had more tests on string-valued variables" and "if we'd limited the fix to the least necessary change" or "if the people pairing on the fix had been more sensitive about memory usage".
if self.value_type == str: | ||
self.max_length = self.set(attr, 'max_length', allowed_type = int) | ||
if self.max_length: | ||
self.dtype = '|S{}'.format(self.max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).
Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length
variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.
I'm also under the impression that this PR make France calculations crash. https://fr.openfisca.org/legislation/swagger currently works (both I'd therefore recommend to revert this PR. |
I can see France builds passing based on Core 24.5.4 (with the max-length condition restored). (ETA: ugh, I spoke too soon. A side-effect of #737 is that openfisca/openfisca-france#1160 is required to make France tests go back to green. The builds that are passing are precisely the builds on that PR. The others are not passing.) |
Fix #738 restores |
Relates to #701
Bug detected by sending
https://fr.openfisca.org/legislation/swagger
example for/trace
endpoint to local web api. The web api answer was:{"error":"Internal server error: Object of type bytes is not JSON serializable"}
Technical changes
str
corresponds tounicode
in Python 2.7 andbytes
in Python 3.7numpy.unicode_
to dynamically cast tounicode
both in Python 2.7 and Python 3.7Note:
A numpy array containing values for a variable with
value_type = str
, automatically converts the values to:numpy.str_
in Python 2.7numpy.bytes_
in Python 3.7In this PR, we update such variables
dtype
inopenfisca_core/variables.py
and this dtype gives:numpy.bytes_
for|S
dtype valuenumpy.unicode_
for|U
orunicode_
dtype value