-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
headers management for multiparts in tabular resource #257
Conversation
Sorry my local tests were not working with python3... |
That's better. @roll this PR is ready for your review |
Hi thanks! I'm going to start it testing across the stack this week |
Awesome ! Actually I got an idea about this implementation. In case of a with-header tabular multipart resource, the implementation could test if the first line of multiparts are always the same as the one from the first part (at the byte level) before discarding it. If not we could raise a no-header error rather than silently discard data rows (which were supposed to be header ones). What do you think ? |
@paulgirard I'm going to think on Monday how to release it properly regarding backward-compatibility... |
\o/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @paulgirard, here are my thoughts. Could you please:
- compare the first row of the rest chunks to the main one
- if it matches
- drop it from the data stream (as you have already implemented)
- if it doesn't match
- preserve it in the data stream
warnings.warn
aUserWarning
saying that it's a deprecated legacy multi-part mode for tabular data and headers will be required since the next major version ofdatapackage
- probably add TODO saying that in the next major version we need to remove this warning and compare headers to raise not matching header error
What do you think? Does it make sense?
Yeah that's pretty close to what I had in mind in my last comment. |
Great! Thanks a lot! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@roll I finally had time to add the new behaviour for deprecated no-header in chunks situation as you asked. |
Hi @paulgirard, sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It looks good
Nice ! Sorry this took so long and so many small commits for such a straight forward feature... |
Great, thanks! |
Overview
A multipart resource concatenates chunks as-is which is what we want for generic binary files.
But for tabular resource this default behaviour implies that if the first chunk does have one header row, the other chunks must not.
We discussed this issue here: https://github.com/frictionlessdata/forum/issues/1
The proposition is to change this behaviour for tabular-data-package: tabular chunks should all have headers or none depending on dialect.header.
Current situation with header row only in first chunck are handled as before but it raises a UserWarning as this situation will soon be deprecated.
Implementation
Multipart chunks are handled by the _MultipartSource class which build an iterator of chunks' rows iterator. My approach is simply to discard first row of chunks (but the first) iterator when the resource is tabular with header.
@roll pointed possible issues with:
Previously posted at #256
Please preserve this line to notify @roll (lead of this repository)