Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type value for string variables #737

Merged
merged 12 commits into from
Oct 11, 2018
Merged

Fix type value for string variables #737

merged 12 commits into from
Oct 11, 2018

Conversation

sandcha
Copy link
Collaborator

@sandcha sandcha commented Oct 9, 2018

Relates to #701

Bug detected by sending https://fr.openfisca.org/legislation/swagger example for /trace endpoint to local web api. The web api answer was:
{"error":"Internal server error: Object of type bytes is not JSON serializable"}


Technical changes

  • Fix an encoding bug of variables with value type str
  • Details:
    • str corresponds to unicode in Python 2.7 and bytes in Python 3.7
    • Set encoding with numpy.unicode_ to dynamically cast to unicode both in Python 2.7 and Python 3.7

Note:

A numpy array containing values for a variable with value_type = str, automatically converts the values to:

  • numpy.str_ in Python 2.7
  • numpy.bytes_ in Python 3.7

In this PR, we update such variables dtype in openfisca_core/variables.py and this dtype gives:

  • numpy.bytes_ for |S dtype value
  • numpy.unicode_ for |U or unicode_ dtype value

@Morendil
Copy link
Contributor

Morendil commented Oct 9, 2018

LOL at this line.

@Morendil
Copy link
Contributor

Morendil commented Oct 9, 2018

Breaking in Python 2 with

AssertionError: <type 'numpy.string_'> != <type 'numpy.unicode_'>

Awwww. I suggest foo.type in [numpy.str_, numpy.unicode_] or something like that. All we care about is getting a string, doesn't matter what kind of string.

@sandcha sandcha self-assigned this Oct 10, 2018
@sandcha sandcha changed the title [WIP] Fix type value for string variables Fix type value for string variables Oct 10, 2018
@sandcha sandcha requested a review from fpagnoux October 10, 2018 15:33
if self.value_type == str:
self.max_length = self.set(attr, 'max_length', allowed_type = int)
if self.max_length:
self.dtype = '|S{}'.format(self.max_length)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpagnoux For a value exceeding max_length, this line reduced the string to max_length. If it was done on purpose, let me know as this behaviour is removed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).

Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reversing this, as it turns out to result in a regression undetected by existing unit tests.

As I recall, @sandcha and I were pairing at the time and we had a miscommunication, I assumed truncation to max_length would still be happening elsewhere as a result of line 168, I had not understood we were eliminating the behaviour.

The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.

@sandcha suggests emitting a truncation warning at execution; I think that does not achieve the purpose of avoiding nasty surprises, because whoever is coding an input variable won't exert control on the actual inputs specified, and whoever actually specifies an overlong input is unlikely to see debug logs from the computation engine.

My preferred solution would be to rename max_length to something like truncate_to_length, to make the behaviour explicit; and a unit test for the truncation behaviour would be ideal.

Copy link
Collaborator Author

@sandcha sandcha Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm just in favour of giving the information to the user somewhere like in a unit test explaining the expected behaviour. A PR is coming.

Copy link
Member

@bonjourmauko bonjourmauko Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truncation behaviour, however, was (still is) untested and implicit. Fixing this properly entails at the very least documenting the truncation behaviour.

👍

My preferred solution would be to rename max_length to something like truncate_to_length

👎 Attribute names of a model are preferably nouns. variable.truncate_to_length sounds like a function, because truncate is a verb

make the behaviour explicit; and a unit test for the truncation behaviour would be ideal

👍 👍 👍

I think the name's not problematic, just that there is no test for something that's supposed to be a feature. Adding a test would be enough IMHO.

@bonjourmauko bonjourmauko self-assigned this Oct 11, 2018
@bonjourmauko bonjourmauko added kind:fix Bugs are defects and failure demand. flow:team:reviewing and removed flow:team:doing labels Oct 11, 2018
CHANGELOG.md Outdated

- Fix a bug on encoding for OpenFisca variables of string type
- Details:
* Set variable value encoding to `numpy.unicode_` in python 2.7 and python 3.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an explanation of the situation would be very useful for other contributors.


simulation = new_simulation(tax_benefit_system, month)
variable_value = simulation.calculate('variable__str_with_max', month)[0]
assert_equal(unicode_, type(variable_value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, asserts works like this:

assert actual_value == desired_value
assert_equal(actual_value, desired_value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it quickly to merge this PR

@bonjourmauko bonjourmauko force-pushed the fix-trace-str-var branch 4 times, most recently from 4d97de0 to f893c91 Compare October 11, 2018 19:36
@bonjourmauko bonjourmauko merged commit ed462ec into master Oct 11, 2018
@bonjourmauko bonjourmauko deleted the fix-trace-str-var branch October 11, 2018 19:39
Copy link
Member

@fpagnoux fpagnoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this has already been merged, I don't think this PR brings the codebase to a consistent and acceptable state.

## 24.5.3 [#734](https://github.com/openfisca/openfisca-core/pull/734)
### 24.5.3 [#737](https://github.com/openfisca/openfisca-core/pull/737)

- Fix an encoding bug of variables with value type str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming late to the party, but I think a more human-readable and accurate description for this PR would be:

  • Use unicode for all string variables
  • Deprecate max_length variable attribute

which I think would have deserve maybe a little more thoughts and time.

For information, there is one drawback to using unicodes: they takes 4 times for space in memories than bytes. What are we getting in exchange? So far, we only use string variables to store zip codes...

I'm not against quick-merge for bug fix PR, but that's only okay if the PR has a limited scope and targets the bug, which is not the case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the fix went beyond what was strictly necessary.

I'm not sure the "thoughts and time" remark is quite justified though, we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles), and both @maukoquiroga and I paired with @sandcha independently so I don't see this as a lack of individual attention or rushing.

I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fpagnoux.

Use unicode for all string variables

That's definitely better, although I didn't know "string" was a Python concept so that's why I didn't use it here.

Deprecate max_length variable attribute

Didn't know we were deprecating it, nor that there was a truncation feature.

Which I think would have deserve maybe a little more thoughts and time.

We were three to work on this specific issue, so more brains weren't available, and invested time was consequent.

Doing the math, we invested ballpark 4h * 3 = 12h of 75% of the team's brains, which is basically around ~10% of our sprint allocated time. I have to agree with @Morendil here:

we have systemic contributors to the incident (lack of unit tests or documentation covering this behaviour, Python 2/3 continuing hassles)

However, I assume the responsibility as I reviewed, cleaned up, rebased, renamed, fixed up and merged.

That being said, I don't know exactly what could I have done to avoid the situation.

Copy link
Member

@fpagnoux fpagnoux Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant by time was more absolute time than brain time:

  • A review is requested at D0, 2 days ago. A comment (where I'm mentioned) clearly seem to indicate that the PR openers don't fully understand the purpose of the code they are removing.
  • At D0+1, the PR is merged, with this comment still unanswered.
  • Later that day, I check the PR, following the mention. I'm surprised by the wide scope of the PR, and then realize France broke.
  • At D0+2, we revert everything

The pity is that we were very close from avoiding the incident. Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.

Having said that, misevaluating the scope of a PR happens. Now we can just:

  • extend unit tests to making it less likely
  • keep in mind than reverting/unpublishing is the best option it done quickly when it happens 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your remark on Unicode, the Python 3 default encoding is UTF8 (and this carries over to Numpy via the "U" dtypes according to the docs) and while I admit to low expertise in string encodings my understanding is that this "variable-width" encoding does not necessarily use four bytes per character.

To illustrate my remark:

import numpy as np

postal_codes = ['75012', '75015', '75007']
bytes_array = np.asarray(postal_codes, dtype=np.bytes_)
unicode_array = np.asarray(postal_codes, dtype=np.unicode_)

print(unicode_array.nbytes)
>>> 60  # 3 items x  20 bytes

print(bytes_array.nbytes)
>>> 15 # 3 items x 5 bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably open a separate issues to discuss all the issues that we have identified.

  • Should all string variables be unicode, for convenience?
  • Should all string variables be bytes, for memory optimisation?
  • Should we allow both? Sounds nice, but is there really a demand for unicode variables?
  • Is the "max_length" field really useful? Should it throw errors instead of silently shortening user inputs?

Copy link
Contributor

@Morendil Morendil Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the plot thickens. This is apparently not documented but here's an interesting discussion pointing out that Numpy's "U" dtype uses UTF-32 internally, rather than UTF-8 as I'd inferred from the Numpy docs.

The thread points to an additional option:

>>> object_array = np.asarray(postal_codes, dtype="O")
>>> object_array.nbytes
24
>>> type(object_array[0])
<class 'str'>

Still some overhead, but limited.

Should all string variables be unicode, for convenience?

The alternative is that as soon as someone tries to use value_type = str for something - anything - that can conceivably contain even one accented character, they're getting set up for a nasty surprise. I think that goes beyond convenience, it's an egregious violation of the principle of least surprise.

If it's important for performance reasons, then one solution is to encode strings internally as byte arrays before storing them into Numpy arrays and decode them back into proper strings whenever their values is needed. To do that, we'd need to hide Numpy arrays (an implementation detail) behind the Variable/Holder (I'm not sure which) interface protocol. This is possibly a good idea (on information hiding principles) irrespective of the current issue, even if it would admittedly involve a lot of busywork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delaying the merge a day, until @sandcha's comment was answered, would have been enough 😬.

Can't disagree, but we are analysing this now with the benefit of hindsight. In Allspaw's words "the action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place".

There are any number of additional counterfactual scenarios where merging the PR wouldn't have resulted in a meltdown, including "if we'd had more tests on string-valued variables" and "if we'd limited the fix to the least necessary change" or "if the people pairing on the fix had been more sensitive about memory usage".

if self.value_type == str:
self.max_length = self.set(attr, 'max_length', allowed_type = int)
if self.max_length:
self.dtype = '|S{}'.format(self.max_length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the point was to make sure a given variable would be limited to n characters, to reduce the memory usage of the variable(typically, a zip code).

Whether this was a good idea or not can be discussed, but removing this along the way in a bug fix PR is problematic IMO. After this PR, the max_length variable attribute is totally useless, yet it hasn't been deprecated, and is still documented. This is not a stable situation.

@fpagnoux
Copy link
Member

fpagnoux commented Oct 11, 2018

I'm also under the impression that this PR make France calculations crash.
When I try to run nosetests tests/test_basics.py in france, I get:
TypeError: startswith first arg must be str or a tuple of str, not numpy.bytes_

https://fr.openfisca.org/legislation/swagger currently works (both /calculate and /trace), but it's based on version 24.5.2, i.e. before this PR. I'm afraid it would crash if we upgraded Core to the current master.

I'd therefore recommend to revert this PR.

@Morendil
Copy link
Contributor

Morendil commented Oct 12, 2018

I can see France builds passing based on Core 24.5.4 (with the max-length condition restored).

(ETA: ugh, I spoke too soon. A side-effect of #737 is that openfisca/openfisca-france#1160 is required to make France tests go back to green. The builds that are passing are precisely the builds on that PR. The others are not passing.)

@bonjourmauko
Copy link
Member

I'd therefore recommend to revert this PR.

Fix #738 restores max_length, but does not reverts this PR's fix (it actually fixes the fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Bugs are defects and failure demand.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants