-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INDEX is currently not collision-free #1444
Comments
I would argue that the current behavior is not intended, or at least that it shouldn't be. |
Here is a collision-free version of INDEX that is also efficient (i.e., it does NOT call tojson on strings). The idea is to use a dictionary with paths of the form [type, str] rather than just [str]. To recover the values in the dictionary, therefore, one would use .[][] rather than just .[]
Example:
yields:
|
@pkoppstein |
The index it builds should have arrays of values as values. That will have knock-on effects on |
A |
Yes, but is it too late for radical surgery for INDEX, given that it has been used quite extensively? If radical surgery it is to be, then I would offer this collision-free version of UNIQUE_INDEX:
Test cases:
|
@pkoppstein Was it in 1.5? If not... |
I think you know full well that you introduced them in Jan this year! Please note that the "collision" issue that I raised refers to the collision between different JSON values -- e.g. the number 1 and the string "1". Apart from collisions in this sense, the current definition of INDEX is quite faithful to the jq philosophy as exhibited by In my own work, I use the term "buckets" for array-valued dictionaries that gather items with the same "index". In my view, jq should support both buckets and the "INDEX" family of functions. Since we're all agreed that INDEX as currently conceived should be collision-free (in the sense that we don't want different values to collide), and since TYPEINDEX defined above breaks the "contract" that INDEX provides -- namely that INDEX()[] should provide the indexed values -- I would propose the following small revisions to the current INDEX functions:
|
@pkoppstein But these are meant to be SQLish. A non- |
@pkoppstein I do like your method of avoiding applying |
Which one - using I suspect the former would be slightly better overall in terms of performance, but have no empirical evidence. Apart from that, are we going to have both INDEX as a bucketizer (*), and INDEX_UNIQUE as discussed before? (*) That is, on the assumption that
|
@pkoppstein I'm thinking of something like:
|
The EDIT: But the docs should make this clear. |
(1) I'm afraid your JOIN baffles me. I'm obviously missing something that seems obvious to you, so would you please show how you'd use JOIN to solve the problem posed by the OP at #1475 ? (2) While your attention is focused on these SQLish operators, could I please ask you to look at
The first of these is sorely needed because the prospects for fixing Tx! |
@pkoppstein Er, some of the
|
With the
The SQLish defs I'm using:
|
@pkoppstein I left commentary on #1440. |
:) |
Consider for example:
1
Currently, INDEX uses
tojson
except when the key is a string, presumably for efficiency.If INDEX should be collision-free, is there a better alternative than using
tojson
unconditionally?If the current implementation is intended, then the manual should provide a prominent warning about collisions.
The text was updated successfully, but these errors were encountered: