Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking] Return missing if the field is set but null. #238

Merged
merged 14 commits into from
Oct 8, 2021

Conversation

yeesian
Copy link
Owner

@yeesian yeesian commented Aug 31, 2021

Fixes #177 .

I think this aligns the behavior a lot better across the different file formats (see the test sets in test/test_tables.jl)

This behavior is implemented through getfield(feature, i) which makes use of getfieldtype(feature, i). That way, we are aligned in behavior for field subtypes even for e.g. displaying the fields.

Because a lot hinges on distinguishing between whether a field is set versus whether it is null (see Toblerity/Fiona#460 (comment)), I have also added support for isfieldnull(), isfieldsetandnotnull(), and setfieldnull!().

Fixes #177 .

I think this aligns the behavior a lot better across the different file formats (see the test sets in test/test_tables.jl)

This behavior is implemented through `getfield(feature, i)` which makes use of `getfieldtype(feature, i)`. That way, we are aligned in behavior for field subtypes even for e.g. displaying the fields.

Because a lot hinges on distinguishing between whether a field is set versus whether it is null (see Toblerity/Fiona#460 (comment)), I have also added support for `isfieldnull()`, `isfieldsetandnotnull()`, and `setfieldnull!()`.
@yeesian yeesian requested review from visr and evetion August 31, 2021 03:05
@yeesian yeesian changed the title Return missing if the field is set but null. [Breaking] Return missing if the field is set but null. Aug 31, 2021
Copy link
Collaborator

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/ogr/feature.jl Outdated Show resolved Hide resolved
@yeesian yeesian marked this pull request as draft September 1, 2021 02:11
@mathieu17g mathieu17g mentioned this pull request Sep 8, 2021
10 tasks
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Sep 14, 2021
…tries, mixed float/int

Test with `nothing` skipped until PR yeesian#238 [Breaking] Return missing if the field is set but null. is merged
As mathieu has observed, `OGRUnsetMarker` and `OGRNullMarkerare` are mutually exclusive. We implement that case to return nothing if it ever comes up, but it is not possible for us to the corresponding code path.
@yeesian yeesian self-assigned this Sep 16, 2021
@yeesian yeesian marked this pull request as ready for review September 16, 2021 06:22
Because getdefault() is meant to be used only when setting fields for notnullable columns with missing values, we make it return `nothing` instead of `missing` to unset the field.
Copy link
Collaborator

@mathieu17g mathieu17g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some suggestions on the src part and put on hold the test part's review, waiting for your feedback

docs/src/considerations.md Outdated Show resolved Hide resolved
src/ogr/feature.jl Outdated Show resolved Hide resolved
src/ogr/feature.jl Outdated Show resolved Hide resolved
src/ogr/feature.jl Show resolved Hide resolved
@yeesian yeesian removed the request for review from visr September 26, 2021 20:31
isfieldsetandnotnull() is just a convenience function.
…fieldtype when reading the field

(in the future, we will switch to dispatch on the fieldtype rather than the current dictionary approach, but that is out of the scope of this commit.)
@visr
Copy link
Collaborator

visr commented Sep 27, 2021

Nice work here! I wasn't aware of unset vs null in GDAL before. I believe null is much more common in practice?

I made this GeoJSONSeq file with both unset and null values:

{ "type": "Feature", "properties": { "a": 1, "b": null }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } }
{ "type": "Feature", "properties": {}, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } } 

If I run this through ogr with ogr2ogr -f GeoJSONSeq unset-gdal-translate.geojson unset.geojson, the result is the same, e.g. unset and null are both kept. If I import the file in QGIS however, it shows this:

image

And indeed if I export it from QGIS I get nulls back in field that were unset before:

{ "type": "Feature", "properties": { "a": 1, "b": null }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } }
{ "type": "Feature", "properties": { "a": null, "b": null }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } } 

Probably it makes more sense to keep the distinction here just like ogr2ogr does, this being a GDAL wrapper. Though I do understand @evetion's point about making everything missing possibly being more user friendly.

@evetion
Copy link
Collaborator

evetion commented Sep 27, 2021

@visr This is all about displaying a non-tabular (sparse) dataset like your geojson in a tabular interface. In this case, QGIS doesn't have a representation of unset, just NULL. Julia can be more expressive, although I wonder whether our Tables interface likes the mixing of missing and nothing.

@visr
Copy link
Collaborator

visr commented Sep 27, 2021

Yeah exactly. I can imagine it being less ideal in the Tables interface. Though these functions are also used outside of that, in the sparse setting, where it does make sense to keep the distinction.

I think the interface can technically handle both just fine, it is just not very conventional. Though I don't know how common unset is in practice, and it's relatively easy to convert nothing to missing, if it causes problems.

@mathieu17g
Copy link
Collaborator

@evetion and @visr, in the draft PR #243 , I'm about to try to handle it for a conversion from a table source to an IFeatureLayer.
I was just waiting for this PR #238 to be merged.

In the other way, since we rely on Tables.jl/src/fallbacks.jl buildcolumns(::Nothing, rowitr::T) where {T} which iterates through all features building column types by "widening" them with promote_type each time it encounters a new type in field values, it should work as follow if we have an FieldDefn of OGRFieldType OFTInteger64 and features with UNSET or NULL fields for this FieldDefn:

julia> T = Int64; @show T
       T = promote_type(T, typeof(nothing)); @show T
       T = promote_type(T, typeof(missing)); @show T
T = Int64
T = Union{Nothing, Int64}
T = Union{Missing, Nothing, Int64}

@visr
Copy link
Collaborator

visr commented Sep 27, 2021

@mathieu17g thanks for showing that indeed technically it works (#243 is a great addition by the way).

I think for the tables interface specifically, the concerns about nothing are:

  • In SQL / JuliaData / Tables.jl packages generally only use missing and not nothing, potentially leading to confusion or errors if our tables are used in functions that only expect missing or data.
  • Union{Missing, Nothing, Int64} may not enable small union optimizations that Union{Missing, Int64} will have (would need to be checked).

Hence for tables I'm less enthousiastic about nothing than for just getfield. Though I'm not sure what's best either. If we go with nothing at worst we force people to explicitly convert to missing, which is not too bad of a price to pay to make the data perfectly round trippable and consistent with GDAL itself. Though if you know a way to optionally do the nothing to missing conversion during table construction instead of afterwards, potentially avoiding an expensive copy, it would be interesting as well.

@yeesian
Copy link
Owner Author

yeesian commented Sep 28, 2021

@evetion @visr is this PR good to go from your perspective? I'm okay with iterating further on feedback or for you to have more time to review if you'd like -- else I'll merge it to keep things moving for Mathieu per #238 (comment)

@mathieu17g
Copy link
Collaborator

mathieu17g commented Sep 28, 2021

Hence for tables I'm less enthousiastic about nothing than for just getfield. Though I'm not sure what's best either. If we go with nothing at worst we force people to explicitly convert to missing, which is not too bad of a price to pay to make the data perfectly round trippable and consistent with GDAL itself.

Yes, and if we need to convert all nothing values to missing later, it won't be difficult

Though if you know a way to optionally do the nothing to missing conversion during table construction instead of afterwards, potentially avoiding an expensive copy, it would be interesting as well.

This should be easily done in getcolumn function which is called for each field and geometry value by Tables.jl/src/fallback.jl Tables.columns function
For the Tables.rows, it can be done by implementing a slightly modified version of Tables.rows fallback function

function Tables.rows(x::AbstractFeatureLayer)
    cols = Tables.columns(x)
    return Tables.RowIterator(cols, Int(Tables.rowcount(cols)))
end

potentially avoiding an expensive copy

I didn't get what you have in mind. Tables.columns fallback constructor always make a copy.
I had not considered the option of not making a copy. Maybe it would allow to modify a dataset from a table (e.g. DataFrame). Brilliant idea!
It would probably require to convert field values in the table to something like Field{OGRFieldType, OGRFieldSubtype} and fully map GDAL objects interdependence in ArchGDAL composite types

Copy link
Collaborator

@visr visr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging this. Left two comments.

docs/src/considerations.md Outdated Show resolved Hide resolved
src/ogr/feature.jl Show resolved Hide resolved
@visr
Copy link
Collaborator

visr commented Sep 28, 2021

@mathieu17g we can discuss copies and such in a different issue. I still wanted to create one, because I saw that converting a layer to DataFrame for a large vector file took minutes.

@yeesian
Copy link
Owner Author

yeesian commented Sep 30, 2021

@mathieu17g we can discuss copies and such in a different issue. I still wanted to create one, because I saw that converting a layer to DataFrame for a large vector file took minutes.

I'm interested in this too! Feel free to open an issue for it or have the discussion in #243

Copy link
Collaborator

@mathieu17g mathieu17g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough! :-)

@yeesian
Copy link
Owner Author

yeesian commented Oct 5, 2021

I'll be merging this PR after #245 is merged if I don't see any further objections by then.

@yeesian
Copy link
Owner Author

yeesian commented Oct 8, 2021

Seems like #245 is turning out to be more complicated than expected, and we're still figuring it out in JuliaGeo/GDAL.jl#124.

In the meantime, there are no remaining issues to resolve here, so I'll merge this PR to unblock #243.

@yeesian yeesian merged commit f27d7df into master Oct 8, 2021
@yeesian yeesian deleted the ogr-field-nullity branch October 8, 2021 04:10
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Oct 9, 2021
…tries, mixed float/int

Test with `nothing` skipped until PR yeesian#238 [Breaking] Return missing if the field is set but null. is merged
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Oct 9, 2021
…tries, mixed float/int

Test with `nothing` skipped until PR yeesian#238 [Breaking] Return missing if the field is set but null. is merged
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Oct 10, 2021
- no difference for geometry columns. Both `nothing` and `missing` values map to an UNSET geometry field (null pointer)
- field set to NULL for `missing` values and not set for `nothing` values
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Oct 12, 2021
…tries, mixed float/int

Test with `nothing` skipped until PR yeesian#238 [Breaking] Return missing if the field is set but null. is merged
mathieu17g added a commit to mathieu17g/ArchGDAL.jl that referenced this pull request Oct 12, 2021
- no difference for geometry columns. Both `nothing` and `missing` values map to an UNSET geometry field (null pointer)
- field set to NULL for `missing` values and not set for `nothing` values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle NULL values in features correctly
4 participants