Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are property values that are objects/arrays encoded in vector tiles? #75

Open
mapsam opened this issue May 3, 2016 · 39 comments
Open

Comments

@mapsam
Copy link
Contributor

mapsam commented May 3, 2016

It would be great to add an example of encoding an Array or Object from the original GeoJSON to the PBF value (from the 4.5 example) to make it clear how the v2 spec should encode these values. Are the values encoded as string values and required to be parsed by the user?

i.e. for an array value:

values: {
  string_value: "[\"one\",\"two\",3]"
}

Related to https://github.com/mapbox/vector-tiles/issues/12

cc @flippmoke @danswick

@jfirebaugh
Copy link
Contributor

My understanding is that array and object values are not part of the specification, and if you want to use them you need to invent your own serialization protocol and ensure that the encoders and decoders in your part of the ecosystem understand it. In other words, there is no "should" here at the level of the VT spec.

@mapsam
Copy link
Contributor Author

mapsam commented May 16, 2016

Thanks @jfirebaugh - this is the case.

@mapsam mapsam closed this as completed May 16, 2016
@e-n-f
Copy link
Contributor

e-n-f commented May 17, 2016

Mapbox Studio Classic and recent versions of Tippecanoe write them as stringified JSON, with no way to tell whether the value originally came from a string or an object.

@mapsam
Copy link
Contributor Author

mapsam commented May 17, 2016

with no way to tell whether the value originally came from a string or an object

Interesting @ericfischer. Does that mean parsers must try to parse as JSON first, then assume it's a string if the JSON cannot be parsed?

@e-n-f
Copy link
Contributor

e-n-f commented May 17, 2016

@mapsam I think most parsers just treat them as strings and don't make any attempt to reconstitute objects. But @ingalls may have more details on what ultimately happens with the values that prompted mapbox/tippecanoe#155.

@mapsam
Copy link
Contributor Author

mapsam commented May 17, 2016

Ah, so it's up the user then to know their string/object-string values. Makes sense!

@mcwhittemore
Copy link

Here is one place this conversation is happening in mapbox-gl-js.

@ingalls
Copy link

ingalls commented May 17, 2016

@mapsam our use case is such that we expect either an array or a stringified array given the property name so for us parsing is relatively easy.

@mcwhittemore
Copy link

@mapsam mentioned reopening this to have a convo about adding how to handle objects and arrays into the spec.

@flippmoke what are your thoughts on adding this in?

@mcwhittemore mcwhittemore reopened this May 17, 2016
@pnorman
Copy link

pnorman commented Aug 12, 2016

+1 to adding it in - multiple people are adding ways to represent lists (e.g. tilezen/mapbox-vector-tile#64)

@nyurik
Copy link

nyurik commented Sep 26, 2016

I have been talking to a large number of people about this at SOTM 2016. The biggest hurdle to store the unlimited key-values (object/dictionary/hstore) and the array support is that Mapnik is fairly rigid in terms of non-simple-type columns, and the number of those columns, so adding this support to Mapnik would require substantial efforts. I would like to propose both the Mapnik workaround as well as the spec change.

To generate an object or an array in vtile using Mapnik, one could generate the data as a single JSON-encoded string value, and add a custom server-side vector post-processing step that decodes JSON and converts it into values. This way, "name": {"en":"London","hu":"London","fr":"Londres"} would be stored as a long string, with multiple identical strings repeating, and the post-processor would convert it to 3 individual tags - "name:en":"London", "name:hu":"London", "name:fr":"Londres" (thus deduping the values).

There are two ways to actually store it:

  • Simple way that does not alter this spec -- simply add all keys and values as the layer message's keys and values, and update the tags fields. This might break some software's expectation that the list of tag columns is predefined, but otherwise no spec change is needed. One drawback of this approach is that the keys would be much longer - "name:en" instead of "en". Once implemented, a collate() function needs to be implemented to handle multiple languages - collate(tags,"name:fr","name:hu","name:en") would get the first existing value.
  • A more complex way is to extend the schema with two additional value types: dictionary and list.

The Value message should have two additional fields:

    // A list of key-value pairs. Just like Feature's "tags", the repeated pairs
    // of integers point to Layer's "keys" (1st integer) and "values" (2nd integer).
    repeated uint32 object_value = 8 [ packed = true ];

    // An array of values. Similar to Feature's "tags", each integer represents
    // an index of the Layer's "values" array.
    repeated uint32 list_value = 9 [ packed = true ];

So if the value contains object_value, the odd integers are the indexes of the Layer's keys array, and the even indexes refer to the Layer's values array. Arrays are implemented the same way, only this time list_value indexes only refer to the Layer's values array.

One of the drawback is that it is now possible to have circular dependencies, so to avoid this and simplify client's data validation, data may only be two levels deep (root+one sublevel). Another unenforced but possibly valuable requirement might be to restrict all values in both object and list to have the same type.

@flippmoke
Copy link
Member

Sorry for my lack of responses on this ticket: Let me summarize some thoughts I have on this currently.

Mapnik Free Vector Tiles

I would like to propose both the Mapnik workaround as well as the spec change.

Eventually we do want to pull out all Mapbox vector tile encoding and decoding such that it is not required to be a part of Mapnik. In general mapnik-vector-tile issues should probably be discussed there. Here on the spec trying to specifically just cover ideas specific to the specification.

Lists

List support could be added by changing section 4.4 - which currently reads as follows:

Feature attributes are encoded as pairs of integers in the tag field of a feature. The first integer in each pair represents the zero-based index of the key in the keys set of the layer to which the feature belongs. The second integer in each pair represents the zero-based index of the value in the values set of the layer to which the feature belongs. Every key index MUST be unique within that feature such that no other attribute pair within that feature has the same key index. A feature MUST have an even number of tag fields. A feature tag field MUST NOT contain a key index or value index greater than or equal to the number of elements in the layer's keys or values set, respectively.

In theory we could allow keys not to be unique per a feature and if more then one occurs we assume the item to be a list. This is not ideal for compression necessarily but could solve the problem.

I do not think this change could qualify as anything but a major version incrementation (though we could debate on this).

Objects

Objects scare me quite a bit more because we suddenly are adding depth "values". This could be something that was very bad for performance overall if not implemented with thought and with some testing.

Example Implementation:

       message Object {
               // Same as feature tags currently
               repeated uint32 tags = 2 [ packed = true ];
       }

       message Value {
                // Exactly one of these values must be present in a valid message
                optional string string_value = 1;
                optional float float_value = 2;
                optional double double_value = 3;
                optional int64 int_value = 4;
                optional uint64 uint_value = 5;
                optional sint64 sint_value = 6;
                optional bool bool_value = 7;
                optional Object object_value = 8; 
                extensions 9 to max;
        }

Final Thoughts

I am going to leave this ticket open for continued discussion. Thanks for the feedback everyone!

@flippmoke
Copy link
Member

@nyurik part of the challenge here with both of our ideas is that that means that all non object values MUST be encoded first. This would put a variety of different requirements on both the encoders and decoders.

@nyurik
Copy link

nyurik commented Sep 27, 2016

@flippmoke could you elaborate on your last comment? I'm not sure I understood.

@flippmoke
Copy link
Member

flippmoke commented Sep 27, 2016

@nyurik

Lets just assume the following structure:

{
    "key1": value1(int)
    "key2": value2(object) {
           "key3": value3(object) {
                 "key4": value4(int),
                 "key5": value5(int)
           },
           "key6": value6(int)
     },
     "key7": value7(int)
}

Because Value now has an "Object" that must know the index of keys and values. The order of encoding becomes critical.

  • value4 and value5 would have to be in the before value3.
  • value3 and value6 would have to be encoded before value2.

Edit: by encoded here I mean placed into values on the Layer.

                // Dictionary encoding for keys
                repeated string keys = 3;

                // Dictionary encoding for values
                repeated Value values = 4;

@nyurik
Copy link

nyurik commented Sep 27, 2016

@flippmoke ah, yes, we (@pnorman and I) were discussing this. I don't think we actually want to allow unlimited nest-ness. It has very limited practical value, whereas having just one level of unlimited key-value storage has a huge practical benefit - primarily for storing all possible translations of a given object (key=name, value={language-code => string}. The moment we put this restriction in place, the whole system becomes much more manageable and easy to implement:

  • No need to validate invalid recursive tags (value3 being the same as value2 in your example)
  • Encoding all the object's keys & values is a linear process - you add them to the Layer's keys & values, record their location, and save it to tags. This is identical to how the tags field is being created now.

@e-n-f
Copy link
Contributor

e-n-f commented Sep 27, 2016

If we're going to allow nesting at all, and it requires a version bump, why not allow arbitrary JSON objects?

Values are already tagged with an integer for their type, and only 1 through 7 are defined. So type 8 could denote that the value is an array (encoded as an integer array of sub-values) and type 9 denote that it is a hash (encoded as an integer array of sub-keys alternating with sub-values).

So a feature with this attribute:

"nested" : { "one" : "two", "three" : [ 4, 5, 6 ] }

would have these keys in the layer:

0. nested
1. one
2. three

and these values:

0. two (string)
1. 4 (int)
2. 5 (int)
3. 6 (int)
4. 1,2,3 (list)
5. 1,0,2,4 (hash)

with an outer key-value list of

0,5

@e-n-f
Copy link
Contributor

e-n-f commented Sep 27, 2016

I just realized that this was what @nyurik already said. Sorry for the needless duplication.

@nyurik
Copy link

nyurik commented Sep 27, 2016

@ericfischer I'm not sure it's the same - I actually think that we should NOT allow arbitrary nested levels, as it would create too many encoding and decoding problems, while not giving us much practical benefits. After all, vector tiles is a very domain-specific data storage format. My proposal is either to not change the spec at all (except that now a vector tile would have any number of non-predefined keys), or, if we change the spec, we only allow two new types (list and object) at the top level. Neither list nor object will allow storing other lists or objects inside of them for the reasons outlined above.

@nyurik
Copy link

nyurik commented Sep 28, 2016

Encoding example

{ "Features": [
  { "str": "aaa",
    "dic": { "en": "aaa", "fr": "bbb"},
    "arr": [ 100, 200 ]
  }, {
    "dic": { "en": "ccc", "gr": "bbb" },
    "arr": [ 100, 300 ]
  }
]}

VTile structure:

      # 0      1     2     3      4      5
keys: [ "str", "en", "fr", "dic", "arr", "gr" ]
values: [
  "aaa"     (string) # 0
  "bbb"     (string) # 1
                     #    key value pairs, value MUST NOT be object or array,
  [1,0,2,1] (object) # 2  all value indexes MUST be less than the index of this element.
  100       (int)    # 3
  200       (int)    # 4
  [3,4]     (array)  # 5  list of value indexes. Same restrictions as for Object
  "ccc"     (string) # 6
  [1,6,5,1] (object) # 7
  300       (int)    # 8
  [3,8]     (array)  # 9
]
features: [
  { tags: [0,0, 3,2, 4,5] },
  { tags: [3,7, 4,9] }
]

@flippmoke, I'm not so sure that repeating keys to make a list is better than adding a new value type. The decoder would need to do an additional key lookup for every new key, to make sure it hasn't seen it before (performance), and because the key needs to be repeated (compression).
UPDATE: @flippmoke yes, the values contained in the object have to be encoded before the object itself (same for arrays). But because nesting is not allowed, this would not be a problem.

@e-n-f
Copy link
Contributor

e-n-f commented Sep 28, 2016

Maybe to make validation easier, there could be a rule that a value can only refer to values that are numbered lower than itself? This would prevent reference loops.

@nyurik
Copy link

nyurik commented Sep 28, 2016

@ericfischer, while it would make the validation easier, it would also make encoding slightly slower and harder. Are there any practical reasons to allow infinite nesting? @pnorman mentioned that allowing nesting inside arrays would make multidimensional (Jagged) arrays possible, but I haven't seen a usage example yet.

@e-n-f
Copy link
Contributor

e-n-f commented Sep 28, 2016

I have seen one case where someone embedded an entirely different GeoJSON feature as an attribute of a GeoJSON feature. I don't know how common this need is in reality though.

From my perspective, guaranteeing that sub-values will be numbered below the value that refers to them seems easy, because you can't build your list of references until you have already assigned IDs to the sub-values, so the ID that then gets assigned to the list of references, once it is complete, will have to be greater than the IDs that were assigned earlier in the process to the things that went into it. But there might be reasons to want to write it the other way around instead. (I should try writing an implementation of this to make sure I'm not missing something important.)

@e-n-f
Copy link
Contributor

e-n-f commented Sep 28, 2016

@nyurik Whether or not it is a good idea, here is my experimental implementation of it in Tippecanoe: mapbox/tippecanoe#303

@nyurik
Copy link

nyurik commented Sep 29, 2016

@ericfischer, thanks. I will attempt to do the same with the https://github.com/mapbox/vector-tile-js .

While we are introducing a major change, question for proto-buf gurus: does it make sense to make keys and values to have lower ID number than features? Would it force protobuf to store features at the end, ensuring that a parser could read content sequentially, instead of skipping features and then coming back? I just looked at the https://github.com/mapbox/pbf/blob/master/index.js#L183 - sees skipping values is a linear traversal.

@flippmoke
Copy link
Member

@ericfischer @nyurik could you both drop sample .proto files to make it easier to understand exactly what you are doing

@pnorman
Copy link

pnorman commented Oct 1, 2016

I also added a null type as 10 so that all the JSON types are covered.

I recommend not adding null


@flippmoke Can you create a dev branch in the repo which is the same as the master, except with 2.1 copied to 3.0? This will let us make PRs against it

@flippmoke
Copy link
Member

@pnorman yes, I will setup a 3.0 dev branch. I have a feeling that a 3.0 branch will be much harder to find a consensus though compared to the 2.0. So don't expect any quick merges

@flippmoke
Copy link
Member

@nyurik
Copy link

nyurik commented Oct 3, 2016

I just re-posted my proposed change as #89. In the future, we should switch to 4 space tabs - easier to view diffs

@e-n-f
Copy link
Contributor

e-n-f commented Oct 3, 2016

@pnorman I have no objection to dropping key-value pairs where the value is null, which Tippecanoe (and, I think, Mapnik) does now when it encounters them. What do you think the right thing is to do when someone specifies null as an element of an array, though?

@kkaefer
Copy link
Member

kkaefer commented Jun 2, 2017

I recommend not adding null

@pnorman why is that?

@ericfischer Our current JS implementation interprets a missing value (e.g. a pbf message with no field set) as null, and I think that'd be preferable to having an explicit null value.

@e-n-f
Copy link
Contributor

e-n-f commented Jun 5, 2017

Thanks @kkaefer. I've updated mapbox/tippecanoe#303 to encode nulls as empty messages rather than as their own type.

@e-n-f
Copy link
Contributor

e-n-f commented Aug 14, 2018

The problem with this that I just discovered is that empty packed arrays in protocol buffers are indistinguishable from an empty message, so empty lists and hashes are mistaken for null values. I think we need magic empty-list and empty-hash messages (or a magic value index that means the same).

@e-n-f
Copy link
Contributor

e-n-f commented Aug 14, 2018

In mapbox/tippecanoe@9319f2e1b2022 I propose:

                // 0 signifies an empty hash, 1 signifies an empty list
                optional uint64 empty_value = 10;

as a way to represent empty lists and hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants