INDEX is currently not collision-free #1444

pkoppstein · 2017-07-07T20:25:07Z

Consider for example:

jq -n '[{id:1, x:1}, {id:"1", x:2}] | INDEX(.id) | length'

1

Currently, INDEX usestojson except when the key is a string, presumably for efficiency.

If INDEX should be collision-free, is there a better alternative than using tojson unconditionally?

If the current implementation is intended, then the manual should provide a prominent warning about collisions.

The text was updated successfully, but these errors were encountered:

ghost · 2017-08-13T15:21:10Z

I would argue that the current behavior is not intended, or at least that it shouldn't be.

pkoppstein · 2017-08-18T19:36:18Z

Here is a collision-free version of INDEX that is also efficient (i.e., it does NOT call tojson on strings). The idea is to use a dictionary with paths of the form [type, str] rather than just [str]. To recover the values in the dictionary, therefore, one would use .[][] rather than just .[]

def TYPEINDEXOF(f):
   f as $x | ($x|type) as $xt | [$xt, if $xt == "string" then $x else $x|tojson end];

def TYPEINDEX(stream; f):
  reduce stream as $s ({}; setpath($s|TYPEINDEXOF(f); $s));

def TYPEINDEX(f): TYPEINDEX(.[]; f);

Example:

[1, {a:2}, "1"] | TYPEINDEX(.) | .[][]

yields:

 1
 {"a":2}
 "1"

nicowilliams · 2017-12-11T19:55:27Z

@pkoppstein to_json is used only because the index is kept as an object (which accepts only strings as keys). Not applying to_json to strings is an optimization, but wrong.

nicowilliams · 2017-12-11T19:57:04Z

INDEX right now does look like something of a unique index, minus the errors one would expect on collision.

The index it builds should have arrays of values as values. That will have knock-on effects on JOIN.

nicowilliams · 2017-12-11T19:57:34Z

A UNIQUE_INDEX would also be nice -- it should raise an error when uniqueness is violated.

pkoppstein · 2017-12-11T21:46:14Z

A UNIQUE_INDEX would also be nice

Yes, but is it too late for radical surgery for INDEX, given that it has been used quite extensively?

If radical surgery it is to be, then I would offer this collision-free version of UNIQUE_INDEX:

def UNIQUE_INDEX(stream; idx_expr):
  def it: type[0:1] + tostring;
  reduce stream as $row ({};
    ($row|idx_expr|it) as $k
    | if has($k) then error("UNIQUE_INDEX constraint violation: key = \($row|idx_expr)")
      else .[$k] = $row end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);

Test cases:

UNIQUE_INDEX(.) | [.[]]
[1,{a:2},"1"]
[1,{"a":2},"1"]

try UNIQUE_INDEX(.) catch 0
[1,{a:2},"1"]
0

nicowilliams · 2017-12-11T21:48:37Z

@pkoppstein Was it in 1.5? If not...

pkoppstein · 2017-12-11T22:34:56Z

Was it in 1.5? If not...

I think you know full well that you introduced them in Jan this year!

Please note that the "collision" issue that I raised refers to the collision between different JSON values -- e.g. the number 1 and the string "1". Apart from collisions in this sense, the current definition of INDEX is quite faithful to the jq philosophy as exhibited by add on JSON objects -- the last object gets the last laugh.

In my own work, I use the term "buckets" for array-valued dictionaries that gather items with the same "index". In my view, jq should support both buckets and the "INDEX" family of functions.

Since we're all agreed that INDEX as currently conceived should be collision-free (in the sense that we don't want different values to collide), and since TYPEINDEX defined above breaks the "contract" that INDEX provides -- namely that INDEX()[] should provide the indexed values -- I would propose the following small revisions to the current INDEX functions:

def INDEX(stream; idx_expr):
  def it: type[0:1] + tostring;
  reduce stream as $row ({}; .[$row|idx_expr|it] = $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);

nicowilliams · 2017-12-11T23:23:10Z

@pkoppstein But these are meant to be SQLish. A non-UNIQUE index in a typical RDBMS allows multiple rows to have the same index key. So should the INDEX functions in jq -- "last wins" (or "first wins") is no fun if you're trying to implement relational semantics :)

nicowilliams · 2017-12-11T23:26:42Z

@pkoppstein I do like your method of avoiding applying tojson to strings.

pkoppstein · 2017-12-11T23:54:42Z

your method of avoiding applying tojson to strings

Which one - using type[0:1] + tostring or TYPEINDEXOF(f)?

I suspect the former would be slightly better overall in terms of performance, but have no empirical evidence.

Apart from that, are we going to have both INDEX as a bucketizer (*), and INDEX_UNIQUE as discussed before?

(*) That is, on the assumption that def it is used to avoid applying tojson on strings:

# buckets
def INDEX(stream; f):
  def it: type[0:1] + tostring;
  reduce stream as $row ({}; .[$row|f|it] += $row);
def INDEX(f): INDEX(.[]; f);

nicowilliams · 2017-12-12T00:01:22Z

@pkoppstein I'm thinking of something like:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | (LOOKUP($idx; idx_expr)) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | ($idx[idx_expr][]) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | ($idx[idx_expr][]) as $row | [., $row] | join_expr;

nicowilliams · 2017-12-12T00:05:21Z

The JOIN is an outer join, so maybe it should be named a bit better, but I think this is probably fine since it's trivial to filter it so that it is an inner join.

EDIT: But the docs should make this clear.

pkoppstein · 2017-12-12T04:03:51Z

@nicowilliams -

(1) I'm afraid your JOIN baffles me. I'm obviously missing something that seems obvious to you, so would you please show how you'd use JOIN to solve the problem posed by the OP at #1475 ?

(2) While your attention is focused on these SQLish operators, could I please ask you to look at IN in the light of PR #1440 ? I've been using the proposed update for IN given there for months:

def IN(s): . as $in | first( if (s == $in) then true else empty end ) // false;
def IN(src; s): any( src|IN(s); .);

The first of these is sorely needed because the prospects for fixing index/1 in the near future seem so dim.

Tx!

nicowilliams · 2017-12-12T04:49:23Z

@pkoppstein Er, some of the JOIN need to change to use LOOKUP, but this works:

$ ./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | .assets as $assets | .descriptions | JOIN(INDEX($assets[]; .assetid); .classid)'
[
  [
    {
      "classid": 1,
      "instanceid": 5,
      "name": "alpha"
    },
    {
      "assetid": 1,
      "classid": 1,
      "instanceid": 5
    }
  ],
  [
    {
      "classid": 2,
      "instanceid": 3,
      "name": "beta"
    },
    {
      "assetid": 2,
      "classid": 1,
      "instanceid": 5
    }
  ],
  [
    {
      "classid": 2,
      "instanceid": 1,
      "name": "gamma"
    },
    {
      "assetid": 2,
      "classid": 1,
      "instanceid": 5
    }
  ]
]

nicowilliams · 2017-12-12T04:56:40Z

With the JOIN functions fixed as below this works:

./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{
  "assetid": 1,
  "name": "alpha"
}
{
  "assetid": 2,
  "name": "beta"
}
{
  "assetid": 2,
  "name": "gamma"
}

The SQLish defs I'm using:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | LOOKUP($idx; idx_expr) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row] | join_expr;

nicowilliams · 2017-12-12T05:06:36Z

@pkoppstein I left commentary on #1440.

nicowilliams · 2017-12-12T05:07:38Z

Was it in 1.5? If not...

I think you know full well that you introduced them in Jan this year!

:)

nicowilliams added the bug label Nov 28, 2017

nicowilliams added this to the 1.6 release milestone Dec 11, 2017

andrew-schweitzer-analog mentioned this issue Nov 13, 2020

Join keys of 2 arrays #1475

Open

itchyny removed this from the 1.6 release milestone Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INDEX is currently not collision-free #1444

INDEX is currently not collision-free #1444

pkoppstein commented Jul 7, 2017

ghost commented Aug 13, 2017

pkoppstein commented Aug 18, 2017 •

edited

Loading

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017 •

edited

Loading

nicowilliams commented Dec 12, 2017 •

edited

Loading

nicowilliams commented Dec 12, 2017 •

edited

Loading

pkoppstein commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

INDEX is currently not collision-free #1444

INDEX is currently not collision-free #1444

Comments

pkoppstein commented Jul 7, 2017

ghost commented Aug 13, 2017

pkoppstein commented Aug 18, 2017 • edited Loading

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

nicowilliams commented Dec 11, 2017

pkoppstein commented Dec 11, 2017 • edited Loading

nicowilliams commented Dec 12, 2017 • edited Loading

nicowilliams commented Dec 12, 2017 • edited Loading

pkoppstein commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

nicowilliams commented Dec 12, 2017

pkoppstein commented Aug 18, 2017 •

edited

Loading

pkoppstein commented Dec 11, 2017 •

edited

Loading

nicowilliams commented Dec 12, 2017 •

edited

Loading

nicowilliams commented Dec 12, 2017 •

edited

Loading