headers management for multiparts in tabular resource #257

paulgirard · 2020-03-04T14:49:16Z

fixes Improve header row handling in multipart tabular resource chunks #256

Overview

A multipart resource concatenates chunks as-is which is what we want for generic binary files.
But for tabular resource this default behaviour implies that if the first chunk does have one header row, the other chunks must not.
We discussed this issue here: https://github.com/frictionlessdata/forum/issues/1
The proposition is to change this behaviour for tabular-data-package: tabular chunks should all have headers or none depending on dialect.header.

Current situation with header row only in first chunck are handled as before but it raises a UserWarning as this situation will soon be deprecated.

Implementation

Multipart chunks are handled by the _MultipartSource class which build an iterator of chunks' rows iterator. My approach is simply to discard first row of chunks (but the first) iterator when the resource is tabular with header.
@roll pointed possible issues with:

resource.raw_read: I've checked this method uses _MultipartSource iterator so should be safe
datapackage.save: I am not sure to understand the risk in there since data are only read from resources. Saving only writes the datapackage not the data itself right ?

Previously posted at #256

Please preserve this line to notify @roll (lead of this repository)

relates frictionlessdata#256

paulgirard · 2020-03-04T15:53:56Z

Sorry my local tests were not working with python3...
I have to solve those issues.

paulgirard · 2020-03-06T17:11:25Z

That's better.
I changed the way to discard header row from multipart streams to make it compliant with 3.x buffer implementation.

@roll this PR is ready for your review

roll · 2020-03-09T16:12:00Z

Hi thanks! I'm going to start it testing across the stack this week

paulgirard · 2020-03-10T15:24:48Z

Awesome !

Actually I got an idea about this implementation.
This implementation requires tabular datapackage ressources' multiparts to all have the same header configuration. Either all have headers or none.
This is a breaking change compared to the previous implementation.
If you'd like to limit the pain for users to comply with this breaking change we could add some magic - I use magic as a signal of "maybe not such a good idea IMHO".
But still I let it out.

In case of a with-header tabular multipart resource, the implementation could test if the first line of multiparts are always the same as the one from the first part (at the byte level) before discarding it. If not we could raise a no-header error rather than silently discard data rows (which were supposed to be header ones).
This feature would avoid messing up loading process of "older-not-consistently-headered" tabular multipart resource.

What do you think ?

roll · 2020-03-13T13:30:49Z

@paulgirard
It seems the implementation is great 👍
I was afraid that it would be much harder.

I'm going to think on Monday how to release it properly regarding backward-compatibility...

paulgirard · 2020-03-13T13:55:30Z

\o/
Glad there were no hidden issues!
Very happy to help datapackage community (starting by my own projects :)

roll

Hi @paulgirard, here are my thoughts. Could you please:

compare the first row of the rest chunks to the main one
if it matches
- drop it from the data stream (as you have already implemented)
if it doesn't match
- preserve it in the data stream
- warnings.warn a UserWarning saying that it's a deprecated legacy multi-part mode for tabular data and headers will be required since the next major version of datapackage
- probably add TODO saying that in the next major version we need to remove this warning and compare headers to raise not matching header error

What do you think? Does it make sense?

paulgirard · 2020-03-18T08:43:36Z

Yeah that's pretty close to what I had in mind in my last comment.
This will ease adoption of the breaking change.
I'll do that in a few days.

roll · 2020-03-18T11:15:43Z

Great! Thanks a lot!

stale · 2020-06-23T17:36:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

…frictionlessdata/datapackage-py/pull/257#pullrequestreview-375804017\n and https://github.com/frictionlessdata/forum/issues/1

paulgirard · 2020-08-13T09:52:19Z

@roll I finally had time to add the new behaviour for deprecated no-header in chunks situation as you asked.
Please consider reopening this PR.
Sorry for the delay.

roll · 2020-08-14T06:29:10Z

Hi @paulgirard, sure

roll

Thanks. It looks good

paulgirard · 2020-08-14T12:48:52Z

Nice ! Sorry this took so long and so many small commits for such a straight forward feature...
Glad to have contributed to a better datapackage-py ;)
If anyone needs a use case of this feature, see : http://github.com/medialab/ricardo_data or http://github.com/medialab/toflit18_data (datapackage-py not yet completely integrated)

roll · 2020-08-14T12:56:33Z

Great, thanks!

paulgirard added 2 commits February 5, 2020 16:28

discarding header row of tabular multipart chunks

203693f

relates frictionlessdata#256

Merge remote-tracking branch 'upstream/master' into multipart_headers

030593a

using streams correctly in python 3.x

aa98d30

roll requested changes Mar 17, 2020

View reviewed changes

stale bot added the wontfix label Jun 23, 2020

stale bot closed this Jul 24, 2020

paulgirard added 2 commits August 13, 2020 10:59

Merge branch 'master' into multipart_headers

98bb6e4

Remove first row in chunk only if != header\n see https://github.com/…

b549d1d

…frictionlessdata/datapackage-py/pull/257#pullrequestreview-375804017\n and https://github.com/frictionlessdata/forum/issues/1

paulgirard added 4 commits August 13, 2020 16:07

bug in row iteration

274adc8

better warning message

9724117

use iterator for streams in multipart fixes frictionlessdata#246

c8d3ed9

stupid merge ... ?

513a316

roll reopened this Aug 14, 2020

stale bot removed the wontfix label Aug 14, 2020

paulgirard added 2 commits August 14, 2020 12:18

Merge branch 'master' into multipart_headers

b4bd78f

linter compatible

028df97

roll approved these changes Aug 14, 2020

View reviewed changes

roll merged commit ab08bd3 into frictionlessdata:master Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

headers management for multiparts in tabular resource #257

headers management for multiparts in tabular resource #257

paulgirard commented Mar 4, 2020 •

edited

Loading

paulgirard commented Mar 4, 2020

paulgirard commented Mar 6, 2020

roll commented Mar 9, 2020

paulgirard commented Mar 10, 2020

roll commented Mar 13, 2020

paulgirard commented Mar 13, 2020

roll left a comment •

edited

Loading

paulgirard commented Mar 18, 2020

roll commented Mar 18, 2020

stale bot commented Jun 23, 2020

paulgirard commented Aug 13, 2020

roll commented Aug 14, 2020

roll left a comment

paulgirard commented Aug 14, 2020

roll commented Aug 14, 2020

headers management for multiparts in tabular resource #257

headers management for multiparts in tabular resource #257

Conversation

paulgirard commented Mar 4, 2020 • edited Loading

Overview

Implementation

Previously posted at #256

paulgirard commented Mar 4, 2020

paulgirard commented Mar 6, 2020

roll commented Mar 9, 2020

paulgirard commented Mar 10, 2020

roll commented Mar 13, 2020

paulgirard commented Mar 13, 2020

roll left a comment • edited Loading

Choose a reason for hiding this comment

paulgirard commented Mar 18, 2020

roll commented Mar 18, 2020

stale bot commented Jun 23, 2020

paulgirard commented Aug 13, 2020

roll commented Aug 14, 2020

roll left a comment

Choose a reason for hiding this comment

paulgirard commented Aug 14, 2020

roll commented Aug 14, 2020

paulgirard commented Mar 4, 2020 •

edited

Loading

roll left a comment •

edited

Loading