Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDEX is currently not collision-free #1444

Open
pkoppstein opened this issue Jul 7, 2017 · 18 comments
Open

INDEX is currently not collision-free #1444

pkoppstein opened this issue Jul 7, 2017 · 18 comments
Labels

Comments

@pkoppstein
Copy link
Contributor

Consider for example:

jq -n '[{id:1, x:1}, {id:"1", x:2}] | INDEX(.id) | length'

1

Currently, INDEX usestojson except when the key is a string, presumably for efficiency.

If INDEX should be collision-free, is there a better alternative than using tojson unconditionally?

If the current implementation is intended, then the manual should provide a prominent warning about collisions.

@ghost
Copy link

ghost commented Aug 13, 2017

I would argue that the current behavior is not intended, or at least that it shouldn't be.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Aug 18, 2017

Here is a collision-free version of INDEX that is also efficient (i.e., it does NOT call tojson on strings). The idea is to use a dictionary with paths of the form [type, str] rather than just [str]. To recover the values in the dictionary, therefore, one would use .[][] rather than just .[]

def TYPEINDEXOF(f):
   f as $x | ($x|type) as $xt | [$xt, if $xt == "string" then $x else $x|tojson end];

def TYPEINDEX(stream; f):
  reduce stream as $s ({}; setpath($s|TYPEINDEXOF(f); $s));

def TYPEINDEX(f): TYPEINDEX(.[]; f);

Example:

[1, {a:2}, "1"] | TYPEINDEX(.) | .[][]

yields:

 1
 {"a":2}
 "1"

@nicowilliams nicowilliams added this to the 1.6 release milestone Dec 11, 2017
@nicowilliams
Copy link
Contributor

@pkoppstein to_json is used only because the index is kept as an object (which accepts only strings as keys). Not applying to_json to strings is an optimization, but wrong.

@nicowilliams
Copy link
Contributor

INDEX right now does look like something of a unique index, minus the errors one would expect on collision.

The index it builds should have arrays of values as values. That will have knock-on effects on JOIN.

@nicowilliams
Copy link
Contributor

A UNIQUE_INDEX would also be nice -- it should raise an error when uniqueness is violated.

@pkoppstein
Copy link
Contributor Author

A UNIQUE_INDEX would also be nice

Yes, but is it too late for radical surgery for INDEX, given that it has been used quite extensively?

If radical surgery it is to be, then I would offer this collision-free version of UNIQUE_INDEX:

def UNIQUE_INDEX(stream; idx_expr):
  def it: type[0:1] + tostring;
  reduce stream as $row ({};
    ($row|idx_expr|it) as $k
    | if has($k) then error("UNIQUE_INDEX constraint violation: key = \($row|idx_expr)")
      else .[$k] = $row end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);

Test cases:

UNIQUE_INDEX(.) | [.[]]
[1,{a:2},"1"]
[1,{"a":2},"1"]

try UNIQUE_INDEX(.) catch 0
[1,{a:2},"1"]
0

@nicowilliams
Copy link
Contributor

@pkoppstein Was it in 1.5? If not...

@pkoppstein
Copy link
Contributor Author

Was it in 1.5? If not...

I think you know full well that you introduced them in Jan this year!

Please note that the "collision" issue that I raised refers to the collision between different JSON values -- e.g. the number 1 and the string "1". Apart from collisions in this sense, the current definition of INDEX is quite faithful to the jq philosophy as exhibited by add on JSON objects -- the last object gets the last laugh.

In my own work, I use the term "buckets" for array-valued dictionaries that gather items with the same "index". In my view, jq should support both buckets and the "INDEX" family of functions.

Since we're all agreed that INDEX as currently conceived should be collision-free (in the sense that we don't want different values to collide), and since TYPEINDEX defined above breaks the "contract" that INDEX provides -- namely that INDEX()[] should provide the indexed values -- I would propose the following small revisions to the current INDEX functions:

def INDEX(stream; idx_expr):
  def it: type[0:1] + tostring;
  reduce stream as $row ({}; .[$row|idx_expr|it] = $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);

@nicowilliams
Copy link
Contributor

@pkoppstein But these are meant to be SQLish. A non-UNIQUE index in a typical RDBMS allows multiple rows to have the same index key. So should the INDEX functions in jq -- "last wins" (or "first wins") is no fun if you're trying to implement relational semantics :)

@nicowilliams
Copy link
Contributor

@pkoppstein I do like your method of avoiding applying tojson to strings.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Dec 11, 2017

your method of avoiding applying tojson to strings

Which one - using type[0:1] + tostring or TYPEINDEXOF(f)?

I suspect the former would be slightly better overall in terms of performance, but have no empirical evidence.

Apart from that, are we going to have both INDEX as a bucketizer (*), and INDEX_UNIQUE as discussed before?

(*) That is, on the assumption that def it is used to avoid applying tojson on strings:

# buckets
def INDEX(stream; f):
  def it: type[0:1] + tostring;
  reduce stream as $row ({}; .[$row|f|it] += $row);
def INDEX(f): INDEX(.[]; f);

@nicowilliams
Copy link
Contributor

nicowilliams commented Dec 12, 2017

@pkoppstein I'm thinking of something like:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | (LOOKUP($idx; idx_expr)) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | ($idx[idx_expr][]) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | ($idx[idx_expr][]) as $row | [., $row] | join_expr;

@nicowilliams
Copy link
Contributor

nicowilliams commented Dec 12, 2017

The JOIN is an outer join, so maybe it should be named a bit better, but I think this is probably fine since it's trivial to filter it so that it is an inner join.

EDIT: But the docs should make this clear.

@pkoppstein
Copy link
Contributor Author

@nicowilliams -

(1) I'm afraid your JOIN baffles me. I'm obviously missing something that seems obvious to you, so would you please show how you'd use JOIN to solve the problem posed by the OP at #1475 ?

(2) While your attention is focused on these SQLish operators, could I please ask you to look at IN in the light of PR #1440 ? I've been using the proposed update for IN given there for months:

def IN(s): . as $in | first( if (s == $in) then true else empty end ) // false;
def IN(src; s): any( src|IN(s); .);

The first of these is sorely needed because the prospects for fixing index/1 in the near future seem so dim.

Tx!

@nicowilliams
Copy link
Contributor

@pkoppstein Er, some of the JOIN need to change to use LOOKUP, but this works:

$ ./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | .assets as $assets | .descriptions | JOIN(INDEX($assets[]; .assetid); .classid)'
[
  [
    {
      "classid": 1,
      "instanceid": 5,
      "name": "alpha"
    },
    {
      "assetid": 1,
      "classid": 1,
      "instanceid": 5
    }
  ],
  [
    {
      "classid": 2,
      "instanceid": 3,
      "name": "beta"
    },
    {
      "assetid": 2,
      "classid": 1,
      "instanceid": 5
    }
  ],
  [
    {
      "classid": 2,
      "instanceid": 1,
      "name": "gamma"
    },
    {
      "assetid": 2,
      "classid": 1,
      "instanceid": 5
    }
  ]
]

@nicowilliams
Copy link
Contributor

With the JOIN functions fixed as below this works:

./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{
  "assetid": 1,
  "name": "alpha"
}
{
  "assetid": 2,
  "name": "beta"
}
{
  "assetid": 2,
  "name": "gamma"
}

The SQLish defs I'm using:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | LOOKUP($idx; idx_expr) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row] | join_expr;

@nicowilliams
Copy link
Contributor

@pkoppstein I left commentary on #1440.

@nicowilliams
Copy link
Contributor

Was it in 1.5? If not...

I think you know full well that you introduced them in Jan this year!

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants