Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join keys of 2 arrays #1475

Open
tsakistan opened this issue Sep 11, 2017 · 30 comments
Open

Join keys of 2 arrays #1475

tsakistan opened this issue Sep 11, 2017 · 30 comments
Labels

Comments

@tsakistan
Copy link

Hello, I have a json with 2 arrays with common id's that I'd like to join but I couldn't find a way to do it:

{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
}

Having the above json as input, I want to get a "name" from "descriptions" for each asset in "assets" by joining classid and instanceid from both arrays.

So the desired output would be:

{"assetid":1, "name":"alpha"}
{"assetid":2, "name":"alpha"}
{"assetid":3, "name":"beta"}
{"assetid":4, "name":"gamma"}

The equivalent in SQL would be something like:
select assets.assetsid, descriptions.name from descriptions, assets where descriptions.classid=assets.classid and descriptions.instanceid=assets.instanceid;

@ghost
Copy link

ghost commented Sep 11, 2017

This is an interesting question! In the future, please submit any usage questions to StackOverflow's jq tag.

First, we want to go through each asset:

.assets[] as $asset

Then, for each asset, we want to go through the descriptions, and find the matching one(s) for the asset; the argument passed to select here would be akin to the where clause in your SQL statement:

(.descriptions[] | select($asset.classid == .classid and $asset.instanceid == .instanceid)) as $description

Finally, we build our desired objects out of our $asset and $description pair:

{"assetid": $asset.assetid, "name": $description.name}

All together now:

.assets[] as $asset | (
  .descriptions[] | select(
    $asset.classid == .classid and
    $asset.instanceid == .instanceid
  )
) as $description | {
  "assetid": $asset.assetid,
  "name": $description.name
}

@ghost
Copy link

ghost commented Sep 11, 2017

A way to make it more readable would be to define a join filter that can be invoked like join(.assets.classid == .descriptions.classid and ...) over the original object, which would filter all possible combinations of assets and descriptions by the query and then return objects containing the merged keys of the asset and description pairs that matched the query. Something like:

def join(where): [
  keys[] as $key |
  [{($key): .[$key][]}]
] | combinations | add | select(where) | add;

You can then use this like:

join(
  .assets.classid == .descriptions.classid and
  .assets.instanceid == .descriptions.instanceid
) | {assetid, name}

@tsakistan
Copy link
Author

This is exactly what I was looking for.
Thank you for the detailed explanation!

@nicowilliams
Copy link
Contributor

There's a JOIN in master:

# SQL-ish operators here:
def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def JOIN($idx; idx_expr):
  [.[] | [., $idx[idx_expr]]];
def JOIN($idx; stream; idx_expr):
  stream | [., $idx[idx_expr]];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | [., $idx[idx_expr]] | join_expr;
def IN(s): reduce (first(select(. == s)) | true) as $v (false; if . or $v then true else false end);
def IN(src; s): reduce (src|IN(s)) as $v (false; if . or $v then true else false end);

@nicowilliams
Copy link
Contributor

@pkoppstein The JOIN in master is relational.

@pkoppstein
Copy link
Contributor

pkoppstein commented Dec 11, 2017

Here are filters for performing relational joins that may be suitable for your "standard library" and/or builtin.jq. Suggestions for improvement are welcome!

These filters emit streams and are therefore named "joins"; this name will hopefully also avoid confusion and/or conflict with the existing filter: join(str).

With the OP data in data.json, the filter:

[.assets, .descriptions] | joins( .instanceid ) | [.assetid, .name]

produces:

[1,"alpha"]
[2,"alpha"]
[3,"beta"]
[4,"gamma"]

The same result is produced by:

joins(.assets[]; .descriptions[]; .instanceid; .instanceid) | [.assetid, .name]

joins/1, joins/2, joins/4, joins/6

Notes:

  • joins(f) is not simply joins(f;f); rather, it will join arbitrarily many "tables".

  • if the filter, f, produces null on a row, that row will be ignored.

  • f can evaluate to any JSON entity (that is, it is not confined to strings and/or integers)

  • for example, using the OP data, f could be [.instanceid, .classid]

  • joins/6 provides the core functionality, corresponding closely to the SQL INNER JOIN:

    SELECT (Table1|p1), (Table2|p2)
    FROM Table1
    INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)

def distinct(s):
  reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x)
  |.[];

# Relational Join
# joins/6 provides similar functionality to the SQL INNER JOIN statement:
#   SELECT (Table1|p1), (Table2|p2)
#     FROM Table1
#     INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are filters.

# joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are filters determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored, otherwise all rows are considered.
#
def joins(s1; s2; filter1; filter2; p1; p2):
  def it: type[0:1] + tostring;
  def ix(s;f):
    reduce s as $x ({};  ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
  # combine two dictionaries using the cartesian product of distinct elements
  def merge:
    .[0] as $d1 | .[1] as $d2
    | ($d1|keys_unsorted[]) as $k
    | if $d2[$k] then distinct($d1[$k][]|p1) as $a | distinct($d2[$k][]|p2) as $b | [$a,$b]
      else empty end;

   [ix(s1; filter1), ix(s2; filter2)] | merge;

def joins(s1; s2; filter1; filter2):
  joins(s1; s2; filter1; filter2; .; .) | add ;

# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def joins(filter1; filter2):
  joins(.[0][]; .[1][]; filter1; filter2);

# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def joins(f):
  # j/0 is defined so TCO is applicable
  def j:
    if length < 2 then .[][]
    else [[ joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
    end;
   j ;

[EDIT (12/13/2017): the core functionality is now provided by joins/6]

@nicowilliams
Copy link
Contributor

These INDEX/JOIN functions help:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | LOOKUP($idx; idx_expr) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row] | join_expr;

which you can use like this:

./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{
  "assetid": 1,
  "name": "alpha"
}
{
  "assetid": 2,
  "name": "beta"
}
{
  "assetid": 2,
  "name": "gamma"
}

To break this down:

  • INDEX(.assets[]; .assetid) produces an index of .assets by .assetid
  • JOIN/4
    • takes an index
    • a stream (here .descriptions[])
    • an expression (here .classid) to extract a key from that stream that will be compared for equality with the keys from the index
    • an expression to merge the two matching rows on the left and and right of the join

@nicowilliams
Copy link
Contributor

Incidentally, I've a branch that adds varargs support to jq. That will allow us to have multi-column group- and sort-by functions, and multi-column indices that let INDEX figure out how to encode the keys. And much more too (state machines, a proper switch statement, and so on).

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{"assetid": 1, "name": "alpha" }
{"assetid": 2, "name": "beta" }
{"assetid": 2, "name": "gamma" }

And there's the main problem. As the OP mentioned, the output is supposed to have 4 results: two for "alpha" and one each for "beta" and "gamma".

The other problem, to me at least, is the hairiness of the human interface to JOIN. My joins can be called in the same way as in SQL, i.e. with only direct references to the "tables" in question and to the matching criterion, e.g.

 [.assets, .descriptions] | joins( .instanceid ) 

@nicowilliams
Copy link
Contributor

nicowilliams commented Dec 12, 2017

@pkoppstein If you look at the data, and compute the desired join by hand, you don't get the example result given. The assets object has 5 distinct keys (1..5), while descriptions had two repeated keys (2):

    [
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}
    ]

EDIT: I assume the example desired result was just an example of the form of desired result, not the one that would necessarily result from the given example input data.

@pkoppstein
Copy link
Contributor

pkoppstein commented Dec 12, 2017

@nicowilliams - With your JOIN and the following query (which corresponds with the one specified by the OP), the four "expected" results are obtained:

JOIN(INDEX(.assets[]; [.classid,.instanceid]); .descriptions[]; [.classid,.instanceid];
     {assetid:(.[1].assetid),name:(.[0].name)})
$ jq -c -f JOIN.jq  assets-descriptions.json
{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}

So at least we're on the same page now.

But I would like to see a simpler interface ...

p.s. You could change |= .+ to += in INDEX/2, no?

@nicowilliams
Copy link
Contributor

@pkoppstein No, += is kinda broken. There's an issue open for that.

I'll think about simplifying the interface.

@pkoppstein
Copy link
Contributor

pkoppstein commented Dec 12, 2017

+= is kinda broken. There's an issue open for that.

Yes, I opened it (#1508). I don't think it's relevant in this particular case.

@nicowilliams
Copy link
Contributor

Well, OK, but it hardly makes a difference which I use.

@pkoppstein
Copy link
Contributor

a proper switch statement

Can't wait!

FYPI, I've updated the posting that defines the family of joins functions. Core functionality is now provided by joins(s1; s2; filter1; filter2; p1; p2) corresponding very closely to:

SELECT (Table1|p1), (Table2|p2)
FROM Table1
INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)

@nicowilliams
Copy link
Contributor

@pkoppstein What I think I'll do is remove the SQL-ish functions in master, finish 1.6, then revisit.

One of the issues we have right now is that linking jq programs is quadratic. Every additional builtin significantly slows down jq start up. I've one thing I could commit soon that will help this, but the real thing we must do is make linking linear, and I only have a half-baked attempt at that.

@pkoppstein
Copy link
Contributor

remove the SQL-ish functions in master

I agree that JOIN can wait; I would like to see a decent IN (assuming there's no time to fix index/1); and certainly if set/1 makes it in, then INDEX can wait too :-)

One thought on Erik Brinkman's contributions: If you delete all the rand functions in builtin.jq but retain the core functionality, we can have our (random) cake and eat it (without quadratic concerns)!

@nicowilliams
Copy link
Contributor

@pkoppstein Everything can wait. What I'd like to do is:

  • release 1.6
  • take @erikbrinkman's PR
  • add your builtins and then some
  • fix linking
  • fix bugs not fixed in 1.6
  • release 1.7

@nicowilliams
Copy link
Contributor

I just don't have time to fix linking now, and we really can't be making startup time much worse.

@pkoppstein
Copy link
Contributor

pkoppstein commented Dec 13, 2017

@nicowilliams - The thing is that INDEX and especially IN have been quite widely "advertised" and used, so if you drop them both without a replacement, it will hurt. As already mentioned, IN is especially valuable in light of the performance issue with index/1.

Since "master" has been able to bear the weight of the SQLish additions without (to my knowledge) anyone raising performance as an issue, it seems to me that it would be reasonable to:

  • drop 5 of the SQLish functions (three JOIN and two INDEX)
  • add set/1, lookup/1, and distinct/1
  • improve IN
  • add the rand stuff with NO builtin.jq functions

That way, we'd be down a net of 2 builtins from where we are now in "master".

But, yes, it's time for 1.6 ... so no need to respond to this posting.

@caramdache
Copy link

caramdache commented Jan 19, 2018

@pkoppstein - I like the simple interface ! However, I have a need for LEFT JOIN. Could that be supported by your code easily?

Here is what I have come up so far, based on https://stackoverflow.com/questions/39830426/join-two-json-files-based-on-common-key-with-jq-utility-or-alternative-way-from:

def leftJoin(a1; a2; key):
  def akey: key | if type == "string" then . else tojson end;
  # hash phase:
  a1 as $a1 # content of a1 seems to be lost otherwise (memory issue?)
  | a2 | INDEX(key) as $h2
  # join phase:
  | reduce $a1[] as $a
      ( [];
        ($a|akey) as $key
        | . + [ $a + if $h2[$key] then $h2[$key] else {} end ]) ;

leftJoin( $file1; $file2; [.id, .ud, .data])

@pkoppstein
Copy link
Contributor

pkoppstein commented Jan 20, 2018

[EDIT: left_joins/6 was corrected on Jan 21, 2018.]

@caramdache - Here are lightly-tested “LEFT JOIN” filters. I've also revised distinct/1 so that it will handle strings more efficiently.

# Emit a stream of the distinct items in the stream
# (avoid tojson)
def distinct(stream):
   reduce stream as $s ({};
     ($s|type) as $t
     | (if $t == "string" then $s else ($s|tostring) end) as $y
     | .[$t][$y] = [$s] )
   | .[][];

# Left Join
# left_joins/6 provides similar functionality to the SQL LEFT JOIN statement:
#   SELECT (Table1|p1), (Table2|p2)
#     FROM Table1
#     LEFT JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are jq filters.

# left_joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are projections determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored,
# otherwise all rows are considered.
#
def left_joins(s1; s2; filter1; filter2; p1; p2):
  def it: type[0:1] + tostring;
  def ix(s;f):
    reduce s as $x ({};  ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
  # combine two dictionaries using the cartesian product of distinct elements
  def merge:
    .[0] as $d1 | .[1] as $d2
    | ($d1|keys_unsorted[]) as $k
    | distinct($d1[$k][]|p1) as $a
    | if $d2[$k] then distinct($d2[$k][]|p2) as $b | [$a,$b]
      else [$a, null]
      end ;

   [ix(s1; filter1), ix(s2; filter2)] | merge;

def left_joins(s1; s2; filter1; filter2):
  left_joins(s1; s2; filter1; filter2; .; .) | add ;

# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def left_joins(filter1; filter2):
  left_joins(.[0][]; .[1][]; filter1; filter2);

# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def left_joins(f):
  # j/0 is defined so TCO is applicable
  def j:
    if length < 2 then .[][]
    else [[ left_joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
    end;
   j ;

@caramdache
Copy link

Thanks @pkoppstein, I'll test this first thing on Monday!

I have some questions in the meantime:

  • Why the need to use distinct always? In SQL, this is a separate keyword that you can add -or not- to the query and shouldn't it be optional here as well?
  • I am uncertain as to whether items in s1 for which filter1 == null should always be ignored.
  • In my use case, I do 2 consecutive left_joins and the objects I am joining to already contain all the fields that I want to obtain as a result. So p1 and p2 are the entire set of fields in my case. How can I specify that using your API? I don't exactly want to specify each field manually because there are over 20 in my case.

@pkoppstein
Copy link
Contributor

pkoppstein commented Jan 21, 2018

@caramdache -

Regarding distinct - first, it seems to me that including distinct is far more useful (not least because otherwise you can easily get counterproductive multiplicative behavior); second, you can omit it if you like :-)

Regarding filter1 == null - same kind of reasoning. Also, a row that is equal tonull (as opposed to a compound entity that contains null) really represents the absence of a "row".

Regarding p1 and p2 - in your case, you will probably want to use left_joins/4. Or you could just specify . for p1 and/or p2. That's the beauty of jq.

@caramdache
Copy link

@pkoppstein, I was thinking distinct and filter1 ==null could be made an option, if we are going to include your code in jq 1.6, which I hope. Regarding p1 and p2, I see. jq feels magical sometimes.

In join/6, isn't there an add missing, which is present in join/4, joing/2 and join/1, or am I missing something obvious?

@pkoppstein
Copy link
Contributor

pkoppstein commented Jan 21, 2018

@caramdache - please note that left_joins/6 has been corrected.

The arity-6 filters are intended to be “low-level building blocks”(*) whereas all the others are intended to be more familiar - that’s why only the arity-4 filters have add explicitly. More documentation and examples are obviously needed regarding this and your other points. Conversely, it would probably help if you could provide some compelling examples to illustrate your concerns.

(*) Notably, the user might wish to define a transformation of the outputs, [$a, $b], that cannot be specified as a function of ($a + $b).

@caramdache
Copy link

caramdache commented Jan 22, 2018

@pkoppstein, this seems to work correctly, as far as I can see on a dataset of 1861 items.

In actual usage, the interface feels a little awkward. I have 2 consecutive joins to perform. With my previous leftJoin/3 method, I could simply pipe them through:

leftJoin($t1; $t2; [.f1]) | leftJoin(.; $t3; [.f2, .f3])

Here, it seems that I have to map(add) / slurp the result of the first join, before I can perform the second join. In addition, I have to create the input arrays, which makes the code a little harder to read. But maybe I am missing something obvious.

BTW, I am not quite sure what changes you've made between your 2 edits. Based on what I can remember, the code looks the same.

@caramdache
Copy link

Thinking more about it, an even more natural syntax seems like:

$t1 | leftJoin($t2; [.f1]) | leftJoin($t3; [.f2, .f3])

@pkoppstein
Copy link
Contributor

@caramdache --

  1. Thanks for testing!

  2. As I mentioned, the arity-6 filters are intended to be "low-level" stream-oriented filters. If any of the others are applicable, it would make sense to use them; likewise, it might make sense to define intermediate-level building blocks on top of them.

  3. jq is fundamentally stream-oriented, so I think that the signatures of the arity-6 filters and the fact that they emit streams are basically appropriate. Please also note that I've included "joins" in their names, first to emphasize their stream-oriented nature, and second to free-up similar names (e.g. left_join).

  4. Assuming the tables $left and $right are presented as arrays, I believe your leftJoin/2 could be implemented as:

    def leftJoin($right; filter): [left_joins(.[]; $right[]; filter)];

@andrew-schweitzer-analog

Maybe I've misunderstood, but it seems that the JOIN example above on the original data / original expected output is not correct, or was correct at the time but is not now correct with jq-1.6. That example is (approximately)

jq -c -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
	INDEX(.assets[]; [.classid,.instanceid]); 
	.descriptions[]; 
	[.classid,.instanceid]|tostring;
 	{assetid:(.[1].assetid),name:(.[0].name)}
)'

which with jq-1.6 produces:

{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}

but the original post expected:

{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"} 

I think the issue is that the INDEX expression 'loses' one of the two rows that have "classid":1 and "isntanceid":1, as discussed here.

I think if the index is built on descriptions rather than on assets we can produce the original expected results:

jq -c -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
	INDEX(.descriptions[]; [.classid, .instanceid]); 
	.assets[];
	[.classid,.instanceid]|tostring;
	{assetid:(.[0].assetid),name:(.[1].name)}
)'

Am I missing something?

Thanks,

Andy

@itchyny itchyny added the support label Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants